Load Libraries

Data Exploration

In this section, we perform our exploratory analysis to understand our dataset and identify potential issues such as missing values and outliers

Loading Datasets

First, we load in our data

eval_data <- read.csv("moneyball-evaluation-data.csv")
train_data <- read.csv("moneyball-training-data.csv")

head(eval, n=3)
##                                                                          
## 1 function (expr, envir = parent.frame(), enclos = if (is.list(envir) || 
## 2     is.pairlist(envir)) parent.frame() else baseenv())                 
## 3 .Internal(eval(expr, envir, enclos))
head(train, n=3)
##                         
## 1 function (x, ...)     
## 2 {                     
## 3     UseMethod("train")
# convert to data.frame and drop INDEX fields
eval_df <- data.frame(eval_data) |>
  drop_columns("INDEX")
train_df <- data.frame(train_data) |>
  drop_columns("INDEX")

Understanding the TARGET_WINS Field in the Moneyball Dataset:

The TARGET_WINS field represents the number of games a team won in a given baseball season.

This is the dependent variable (target variable) in the multiple linear regression model.

We will be trying to predict TARGET_WINS based on other team performance metrics.

Viewing Field Values:

We can see there are fields with null values

str(eval_df)
## 'data.frame':    259 obs. of  15 variables:
##  $ TEAM_BATTING_H  : int  1209 1221 1395 1539 1445 1431 1430 1385 1259 1397 ...
##  $ TEAM_BATTING_2B : int  170 151 183 309 203 236 219 158 177 212 ...
##  $ TEAM_BATTING_3B : int  33 29 29 29 68 53 55 42 78 42 ...
##  $ TEAM_BATTING_HR : int  83 88 93 159 5 10 37 33 23 58 ...
##  $ TEAM_BATTING_BB : int  447 516 509 486 95 215 568 356 466 452 ...
##  $ TEAM_BATTING_SO : int  1080 929 816 914 416 377 527 609 689 584 ...
##  $ TEAM_BASERUN_SB : int  62 54 59 148 NA NA 365 185 150 52 ...
##  $ TEAM_BASERUN_CS : int  50 39 47 57 NA NA NA NA NA NA ...
##  $ TEAM_BATTING_HBP: int  NA NA NA 42 NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  1209 1221 1395 1539 3902 2793 1544 1626 1342 1489 ...
##  $ TEAM_PITCHING_HR: int  83 88 93 159 14 20 40 39 25 62 ...
##  $ TEAM_PITCHING_BB: int  447 516 509 486 257 420 613 418 497 482 ...
##  $ TEAM_PITCHING_SO: int  1080 929 816 914 1123 736 569 715 734 622 ...
##  $ TEAM_FIELDING_E : int  140 135 156 124 616 572 490 328 226 184 ...
##  $ TEAM_FIELDING_DP: int  156 164 153 154 130 105 NA 104 132 145 ...
##  - attr(*, ".internal.selfref")=<externalptr>
str(train_df)
## 'data.frame':    2276 obs. of  16 variables:
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Faceted Scatter Plot with Linear Regression Lines:

These scatter plots give us a sense of the relationship between the each variable and TARGET_WINS. Data points are plotted where the x-axis represents the predictor variable, and the y-axis represents the number of wins. A black trend line is fitted using linear regression is showing the general direction of the relationship.

train_df %>%
  gather(variable, value, -TARGET_WINS) %>%
  ggplot(., aes(value, TARGET_WINS)) + 
  geom_point(fill = "#628B3A", color="#628B3A")  + 
  geom_smooth(method = "lm", se = FALSE, color = "black") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = "Wins")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 3478 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 3478 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpretation

The slope of the regression line in each facet is used to determine the strength of relationship between the independent variable represent on the x-axis vs the dependent variable y (TARGET_WINS). The steeper the slope, the stronger the relationship is between the two variables. The direction of the slope tells whether the relationship is positive or negative: - if the line is sloped to the right, it is a positive relationship meaning we can expect an increase in y as x increases - if the line is sloped to the left, it is a negative relationship meaning we can expect an decrease in y as x increases - if the trend line is flat, there is likely no meaningful relationship between that variable and TARGET_WINS.

If the points are closely clustered around the line, it suggests a stronger linear relationship. If the points are widely scattered, the variable may not strongly predict TARGET_WINS.

Positive Predictors of Wins: TEAM_BATTING_2B, TEAM_BATTING_BB, TEAM_BATTING_H, TEAM_PITCHING_BB (unexpected).

Negative Predictors of Wins: TEAM_FIELDING_E, TEAM_PITCHING_H, TEAM_PITCHING_SO.

Weak or No Influence: TEAM_BATTING_3B, TEAM_BATTING_HBP, TEAM_BATTING_SO, TEAM_FIELDING_DP, TEAM_PITCHING_HR.

Sorted Correlations

Correlation Between All Variables

plot_correlation(train_df, type = "all")
## Warning: Removed 150 rows containing missing values or values outside the scale range
## (`geom_text()`).

Interpretration

The Correlation Heatmap shows that the following variables are very highly correlated with one another - TEAM_BATTING_HR and TEAM_PITCHING_HR have a correlation value of 0.97 - TEAM_FIELDING_E and TEAM_PITCHING_H have a correlation value of 0.67 - TEAM_FIELDING_E and TEAM_BATTING_BB have a correlation value of -0.66 - TEAM_BATTING_3B and TEAM_BATTING_HR have a correlation value of -0.64 - TEAM_FIELDING_E and TEAM_BATTING_HR have a correlation value of -0.59 - TEAM_FIELDING_E and TEAM_PITCHING_HR have a correlation value of -0.57 - TEAM_BATTING_H and TEAM_BATTING_2B have a correlation value of -0.57 - TEAM_FIELDING_E and TEAM_BATTING_3B have a correlation value of 0.51 -

Correlation Between TARGET_WINS and [x]

correlation_with_target <- cor(train_df, use = "complete.obs")["TARGET_WINS", ] %>%
  sort(decreasing = TRUE)  # Sort from highest to lowest correlation
print(correlation_with_target)
##      TARGET_WINS  TEAM_PITCHING_H   TEAM_BATTING_H  TEAM_BATTING_BB 
##       1.00000000       0.47123431       0.46994665       0.46868793 
## TEAM_PITCHING_BB TEAM_PITCHING_HR  TEAM_BATTING_HR  TEAM_BATTING_2B 
##       0.46839882       0.42246683       0.42241683       0.31298400 
## TEAM_BATTING_HBP  TEAM_BASERUN_SB  TEAM_BATTING_3B  TEAM_BASERUN_CS 
##       0.07350424       0.01483639      -0.12434586      -0.17875598 
## TEAM_FIELDING_DP  TEAM_BATTING_SO TEAM_PITCHING_SO  TEAM_FIELDING_E 
##      -0.19586601      -0.22889273      -0.22936481      -0.38668800
library(ggplot2)

correlation_data <- data.frame(Variable = names(correlation_with_target), Correlation = correlation_with_target)

ggplot(correlation_data, aes(x = reorder(Variable, Correlation), y = Correlation, fill = Correlation > 0)) +
  geom_bar(stat = "identity") +
  coord_flip() +  # Flip for better readability
  labs(title = "Correlation with Target Wins", x = "Variables", y = "Correlation") +
  scale_fill_manual(values = c("red", "blue")) +
  theme_minimal()

### Checking Correlations and Finding Action:

Strong positive correlation (> 0.5) -> keep the variable in regression.

Strong negative correlation (< -0.5) -> Keep (inverse relationship).

Weak correlation (~0) -> Consider removing.

Two highly correlated variables (> 0.85) -> Drop one to avoid multicollinearity.

Summary of Datasets

skim(eval_df)
Data summary
Name eval_df
Number of rows 259
Number of columns 15
_______________________
Column type frequency:
numeric 15
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
TEAM_BATTING_H 0 1.00 1469.39 150.66 819 1387.0 1455.0 1548.00 2170 ▁▂▇▁▁
TEAM_BATTING_2B 0 1.00 241.32 49.52 44 210.0 239.0 278.50 376 ▁▂▇▇▂
TEAM_BATTING_3B 0 1.00 55.91 27.14 14 35.0 52.0 72.00 155 ▇▇▃▁▁
TEAM_BATTING_HR 0 1.00 95.63 56.33 0 44.5 101.0 135.50 242 ▆▅▇▃▁
TEAM_BATTING_BB 0 1.00 498.96 120.59 15 436.5 509.0 565.50 792 ▁▁▅▇▁
TEAM_BATTING_SO 18 0.93 709.34 243.11 0 545.0 686.0 912.00 1268 ▁▃▇▇▂
TEAM_BASERUN_SB 13 0.95 123.70 93.39 0 59.0 92.0 151.75 580 ▇▃▁▁▁
TEAM_BASERUN_CS 87 0.66 52.32 23.10 0 38.0 49.5 63.00 154 ▂▇▃▁▁
TEAM_BATTING_HBP 240 0.07 62.37 12.71 42 53.5 62.0 67.50 96 ▃▇▅▁▁
TEAM_PITCHING_H 0 1.00 1813.46 1662.91 1155 1426.5 1515.0 1681.00 22768 ▇▁▁▁▁
TEAM_PITCHING_HR 0 1.00 102.15 57.65 0 52.0 104.0 142.50 336 ▇▇▆▁▁
TEAM_PITCHING_BB 0 1.00 552.42 172.95 136 471.0 526.0 606.50 2008 ▆▇▁▁▁
TEAM_PITCHING_SO 18 0.93 799.67 634.31 0 613.0 745.0 938.00 9963 ▇▁▁▁▁
TEAM_FIELDING_E 0 1.00 249.75 230.90 73 131.0 163.0 252.00 1568 ▇▁▁▁▁
TEAM_FIELDING_DP 31 0.88 146.06 25.88 69 131.0 148.0 164.00 204 ▁▂▇▇▂
skim(train_df)
Data summary
Name train_df
Number of rows 2276
Number of columns 16
_______________________
Column type frequency:
numeric 16
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
TARGET_WINS 0 1.00 80.79 15.75 0 71.0 82.0 92.00 146 ▁▁▇▅▁
TEAM_BATTING_H 0 1.00 1469.27 144.59 891 1383.0 1454.0 1537.25 2554 ▁▇▂▁▁
TEAM_BATTING_2B 0 1.00 241.25 46.80 69 208.0 238.0 273.00 458 ▁▆▇▂▁
TEAM_BATTING_3B 0 1.00 55.25 27.94 0 34.0 47.0 72.00 223 ▇▇▂▁▁
TEAM_BATTING_HR 0 1.00 99.61 60.55 0 42.0 102.0 147.00 264 ▇▆▇▅▁
TEAM_BATTING_BB 0 1.00 501.56 122.67 0 451.0 512.0 580.00 878 ▁▁▇▇▁
TEAM_BATTING_SO 102 0.96 735.61 248.53 0 548.0 750.0 930.00 1399 ▁▆▇▇▁
TEAM_BASERUN_SB 131 0.94 124.76 87.79 0 66.0 101.0 156.00 697 ▇▃▁▁▁
TEAM_BASERUN_CS 772 0.66 52.80 22.96 0 38.0 49.0 62.00 201 ▃▇▁▁▁
TEAM_BATTING_HBP 2085 0.08 59.36 12.97 29 50.5 58.0 67.00 95 ▂▇▇▅▁
TEAM_PITCHING_H 0 1.00 1779.21 1406.84 1137 1419.0 1518.0 1682.50 30132 ▇▁▁▁▁
TEAM_PITCHING_HR 0 1.00 105.70 61.30 0 50.0 107.0 150.00 343 ▇▇▆▁▁
TEAM_PITCHING_BB 0 1.00 553.01 166.36 0 476.0 536.5 611.00 3645 ▇▁▁▁▁
TEAM_PITCHING_SO 102 0.96 817.73 553.09 0 615.0 813.5 968.00 19278 ▇▁▁▁▁
TEAM_FIELDING_E 0 1.00 246.48 227.77 65 127.0 159.0 249.25 1898 ▇▁▁▁▁
TEAM_FIELDING_DP 286 0.87 146.39 26.23 52 131.0 149.0 164.00 228 ▁▂▇▆▁
summary(train_df)
##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##                                                                  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##                                   NA's   :102      NA's   :131    
##  TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.0   Min.   :29.00    Min.   : 1137   Min.   :  0.0   
##  1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419   1st Qu.: 50.0   
##  Median : 49.0   Median :58.00    Median : 1518   Median :107.0   
##  Mean   : 52.8   Mean   :59.36    Mean   : 1779   Mean   :105.7   
##  3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682   3rd Qu.:150.0   
##  Max.   :201.0   Max.   :95.00    Max.   :30132   Max.   :343.0   
##  NA's   :772     NA's   :2085                                     
##  TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0   
##  1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0   1st Qu.:131.0   
##  Median : 536.5   Median :  813.5   Median : 159.0   Median :149.0   
##  Mean   : 553.0   Mean   :  817.7   Mean   : 246.5   Mean   :146.4   
##  3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2   3rd Qu.:164.0   
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0   
##                   NA's   :102                        NA's   :286
summary(eval_df)
##  TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B  TEAM_BATTING_HR 
##  Min.   : 819   Min.   : 44.0   Min.   : 14.00   Min.   :  0.00  
##  1st Qu.:1387   1st Qu.:210.0   1st Qu.: 35.00   1st Qu.: 44.50  
##  Median :1455   Median :239.0   Median : 52.00   Median :101.00  
##  Mean   :1469   Mean   :241.3   Mean   : 55.91   Mean   : 95.63  
##  3rd Qu.:1548   3rd Qu.:278.5   3rd Qu.: 72.00   3rd Qu.:135.50  
##  Max.   :2170   Max.   :376.0   Max.   :155.00   Max.   :242.00  
##                                                                  
##  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB TEAM_BASERUN_CS 
##  Min.   : 15.0   Min.   :   0.0   Min.   :  0.0   Min.   :  0.00  
##  1st Qu.:436.5   1st Qu.: 545.0   1st Qu.: 59.0   1st Qu.: 38.00  
##  Median :509.0   Median : 686.0   Median : 92.0   Median : 49.50  
##  Mean   :499.0   Mean   : 709.3   Mean   :123.7   Mean   : 52.32  
##  3rd Qu.:565.5   3rd Qu.: 912.0   3rd Qu.:151.8   3rd Qu.: 63.00  
##  Max.   :792.0   Max.   :1268.0   Max.   :580.0   Max.   :154.00  
##                  NA's   :18       NA's   :13      NA's   :87      
##  TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
##  Min.   :42.00    Min.   : 1155   Min.   :  0.0    Min.   : 136.0  
##  1st Qu.:53.50    1st Qu.: 1426   1st Qu.: 52.0    1st Qu.: 471.0  
##  Median :62.00    Median : 1515   Median :104.0    Median : 526.0  
##  Mean   :62.37    Mean   : 1813   Mean   :102.1    Mean   : 552.4  
##  3rd Qu.:67.50    3rd Qu.: 1681   3rd Qu.:142.5    3rd Qu.: 606.5  
##  Max.   :96.00    Max.   :22768   Max.   :336.0    Max.   :2008.0  
##  NA's   :240                                                       
##  TEAM_PITCHING_SO TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   :   0.0   Min.   :  73.0   Min.   : 69.0   
##  1st Qu.: 613.0   1st Qu.: 131.0   1st Qu.:131.0   
##  Median : 745.0   Median : 163.0   Median :148.0   
##  Mean   : 799.7   Mean   : 249.7   Mean   :146.1   
##  3rd Qu.: 938.0   3rd Qu.: 252.0   3rd Qu.:164.0   
##  Max.   :9963.0   Max.   :1568.0   Max.   :204.0   
##  NA's   :18                        NA's   :31

Calculate mean, median, and standard deviation for key numerical columns

train_df %>%
  summarise(across(where(is.numeric), list(mean = mean, median = median, sd = sd), na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

The summarise(across()) function above is calculating the mean, median, ans standard deviation for all the numerical variables.

Analysis of the training Dataset Before Fitting a Multiple Linear Regression Model:

Before fitting a multiple linear regression (MLR) model, we analyze the dataset for potential issues such as missing values, extreme outliers, multicollinearity, and variable distributions.

Here’s what we can infer from the summary statistics:

1. Understanding the Target Variable (TARGET_WINS)

Range: 0 to 146 wins

Mean: ~80.79 wins

Median: 82 wins

Distribution: The min value of 0 and max of 146 suggest some potential outliers or erroneous data points, since most teams win between 50-110 games in a season.

boxplot(train_df$TARGET_WINS, main="Distribution of Team Wins", ylab="Wins", col="lightblue")

Actionable Steps:

  • Drop row with zero (0) value.

2. Missing Values

Some variables have a significant number of missing values. In particular:

  • TEAM_BATTING_SO (102 missing values)
  • TEAM_BASERUN_SB (131 missing values)
  • TEAM_BASERUN_CS (772 missing values) <- potentially unreliable
  • TEAM_BATTING_HBP (2085 missing values) <- very unreliable
  • TEAM_PITCHING_SO (102 missing values)
  • TEAM_FIELDING_DP (286 missing values)

Additionally, four variables have values of zero (0) reported that appear suspicious. In particular: - TEAM_BATTING_SO & TEAM_PITHCING_SO have the same rows entered as zero suggesting that data may not have been available for these entries. - TEAM_BATTING_HR & TEAM_PITHCING_HR have the same rows entered as zero suggesting that data may not have been available for these entries.

Actionable Steps:

  • Impute missing values (using mean/median or regression techniques).

  • Consider removing TEAM_BATTING_HBP and TEAM_BASERUN_CS if they are highly incomplete and do not contribute much.

3. Potential Outliers & Data Issues

Several variables have extreme max values that seem unrealistic. Specifically:

  • TEAM_PITCHING_H (Max = 30,132) <- Likely an error since typical values range from 1,200 - 1,700.

  • TEAM_PITCHING_SO (Max = 19,278) <- Suspiciously high (typical range: 500 - 1,500).

  • TEAM_PITCHING_BB (Max = 3,645) <- Very high (typical range: 300 - 700).

  • TEAM_FIELDING_E (Max = 1,898) <- Likely an error since the normal range is ~ 70-200.

Actionable Steps:

  • Check for data entry errors.

  • Remove extreme outliers if they distort model performance.

4. Feature Selection Considerations:

Batting Variables: TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB are likely strong predictors of team wins. Note that TEAM_BATTING_H includes TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR data.

Pitching Variables: TEAM_PITCHING_H, TEAM_PITCHING_HR, TEAM_PITCHING_BB, TEAM_PITCHING_SO will impact defensive strength.

Fielding Variables: TEAM_FIELDING_E (errors) and TEAM_FIELDING_DP (double plays) may have a weaker impact compared to batting and pitching.

Actionable Steps:

TEAM_PITCHING_H, TEAM_PITCHING_HR, and TEAM_PITCHING_BB may be highly correlated, which can cause multicollinearity in the regression model.

TEAM_BATTING_H, TEAM_BATTING_2B, and TEAM_BATTING_3B may also be strongly correlated since total hits include doubles and triples.

5. Data Cleaning Recommendations Before Fitting the Model

Handle Missing Values

Consider dropping or imputing variables with too many missing values (e.g., TEAM_BATTING_HBP).

Impute TEAM_BASERUN_SB, TEAM_BASERUN_CS, and TEAM_FIELDING_DP appropriately.

  • Remove or Adjust Extreme Outliers

Remove highly unrealistic values in pitching, fielding, and errors.

  • Check for Multicollinearity

Use Variance Inflation Factor (VIF) to detect multicollinearity and drop redundant features.

  • Feature Engineering

Consider derived metrics like batting average (H/AB), on-base percentage (OBP), or earned run average (ERA) instead of raw counts.

Conclusion

The dataset contains inconsistencies, missing values, and extreme outliers that need to be addressed before fitting an MLR model.

Once cleaned, feature selection and multicollinearity checks will be essential to ensure a robust and interpretable model for predicting team wins.

Analysis of the Box Plot for TARGET_WINS (Team Wins Distribution)

This box plot provides valuable insights into the distribution of team wins in the training dataset.

Here’s what we can infer:

1. Median and Spread of Wins

The thick horizontal line inside the box represents the median (~82 wins).

The box itself (Interquartile Range - IQR) shows the middle 50% of the data, which seems to range roughly from 70 to 92 wins.

2. Box Plot – for Outlier Detection & Distribution Analysis:

  • Low-end outliers (~0-40 wins): There are several small circles (outliers) below the lower whisker.

  • High-end outliers (~110-146 wins): There are some outliers above the upper whisker, but visually fewer than the low-end.

Consideration for Regression?

  • These low-win teams might be problematic for modeling because they could represent incomplete or missing data.

  • Potential data entry issues (e.g., a team with 0 wins) should be checked.

  • If extreme values skew the regression, we might need transformations (log scaling)

3. Data Skewness and Symmetry

The box is fairly centered, suggesting a roughly symmetric distribution.

hist(train_df$TARGET_WINS, 
     main = "Histogram of Team Wins with Density Curve", 
     xlab = "Wins", 
     col = "lightgray", 
     border = "black", 
     breaks = 20, 
     probability = TRUE)  # Converts y-axis to density

lines(density(train_df$TARGET_WINS, na.rm = TRUE), 
      col = "red", 
      lwd = 2)  # Adds a red density curve

Data Preparation

missing_values <- train_df %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count")

print(missing_values)
## # A tibble: 16 × 2
##    Variable         Missing_Count
##    <chr>                    <int>
##  1 TARGET_WINS                  0
##  2 TEAM_BATTING_H               0
##  3 TEAM_BATTING_2B              0
##  4 TEAM_BATTING_3B              0
##  5 TEAM_BATTING_HR              0
##  6 TEAM_BATTING_BB              0
##  7 TEAM_BATTING_SO            102
##  8 TEAM_BASERUN_SB            131
##  9 TEAM_BASERUN_CS            772
## 10 TEAM_BATTING_HBP          2085
## 11 TEAM_PITCHING_H              0
## 12 TEAM_PITCHING_HR             0
## 13 TEAM_PITCHING_BB             0
## 14 TEAM_PITCHING_SO           102
## 15 TEAM_FIELDING_E              0
## 16 TEAM_FIELDING_DP           286
ggplot(missing_values, aes(y = reorder(Variable, Missing_Count), x = Missing_Count, fill = Missing_Count > 0)) +
  geom_col() +
  labs(title = "Missing Values in Training Dataset",
       x = "Number of Missing Values",
       y = "Variables") +
  scale_fill_manual(values = c("gray", "red"), labels = c("No Missing", "Has Missing")) +
  theme_minimal()

### Strategy for Missing Values Column wise:

There are four main options for handling missing values:

1. Remove Rows with Missing Target (TARGET_WINS)

If the TARGET_WINS column has missing values, remove those rows since we can’t predict missing outcomes.

train_df <- train_df %>% filter(!is.na(TARGET_WINS))

2. Removing Columns with Too Many Missing Values

If a column has too many missing values (e.g., >50% missing), it may be better to remove it.

The TEAM_BATTING_HBP column is mostly empty and not critical.

train_df <- train_df[, !names(train_df) %in% "TEAM_BATTING_HBP"]

3. Address Suspicious Values

The variables TEAM_BATTING_SO, TEAM_PITCHING_SO, TEAM_BATTING_HR, and TEAM_PITCHING_HR included several observations with a value of zero. As the rows were the same for both variables, it appears that these may also be missing observations as the likelihood that a team’s batters did not have a single strikeout nor did their pitchers pitch a single strikeout over the course of a 162 game season is highly unlikely. We will therefore treat these as missing observations and impute using the median value which is more robust to outliers then using the mean values. We will also drop the single row where the team did not win a single game, as this is also suspicious.

# Convert dubious stats to NAs for pitching
# and drop unnecessary columns
train_df <- train_df |>
  mutate(
    TEAM_BATTING_SO = if_else(TEAM_BATTING_SO > 0, TEAM_BATTING_SO, NA_integer_),
    TEAM_PITCHING_SO = if_else(TEAM_PITCHING_SO > 0, TEAM_PITCHING_SO, NA_integer_),
    TEAM_BATTING_HR = if_else(TEAM_BATTING_HR > 0, TEAM_BATTING_HR, NA_integer_),
    TEAM_PITCHING_HR = if_else(TEAM_PITCHING_HR > 0, TEAM_PITCHING_HR, NA_integer_)
  ) |>
  filter(TARGET_WINS > 0) 

Handling Missing Values

Next, we will address the remaining missing values. We will weight several options

Option 1: Removing Missing Values

  • Missing Values Cause Errors in Regression Models

  • lm() in R cannot handle missing values and will return an error if NAs exist in predictor variables.

  • Removing missing values ensures that the model runs smoothly without interruptions.

Pros

  • Convenience. If too many rows have missing values, R removes them automatically, reducing sample size.

Cons

  • If missing values are not randomly distributed, removing them may bias the dataset.

  • Reduced sample size

Conclusion

Instead of na.omit(), imputation methods (like mean/median filling) may be better for handling missing data.

Removing rows with missing data is sometimes not the best approach, especially if a large portion of data is lost. We will therefore explore other methods.

Option 2: Mean

Instead of removing missing values, fill them with the mean:

train_df_mean <- train_df
mean_val <- colMeans(train_df, na.rm = TRUE)

# Impute using Means
for(i in colnames(train_df))
    train_df[,i][is.na(train_df[,i])] <- mean_val[i]

plot_qq(train_df, sampled_rows = 1000L)

# Missing values per column
colSums(is.na(train_df_mean))
##      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
##                0                0                0                0 
##  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##               14                0              121              131 
##  TEAM_BASERUN_CS  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB 
##              772                0               14                0 
## TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##              121                0              285

Pros

Cons Not usually recommended for serious analyses because it reduces the variance and doesn’t account for the uncertainty in imputations.

Option 3: Median

Instead of removing missing values, fill them with the median:

library(miscTools)

train_df_median <- train_df
median_val <- colMedians(train_df, na.rm = TRUE)

# Impute using medians
for(i in colnames(train_df))
    train_df_median[,i][is.na(train_df_median[,i])] <- as.integer(median_val[i])

plot_qq(train_df_median, sampled_rows = 1000L)

# Missing values per column
colSums(is.na(train_df_median))
##      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
##                0                0                0                0 
##  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##                0                0                0                0 
##  TEAM_BASERUN_CS  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB 
##                0                0                0                0 
## TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##                0                0                0

Pros

Cons

Option 4: Multivariate Imputation by Chained Equations (MICE)

Multivariate Imputation by Chained Equations (MICE) is a type of Multiple Imputation.

train_df_mice <- mice(data=train_df, m=30, maxit=10, seed=12345, print=FALSE)
train_df_clean <- train_df_mice$data
sum(is.na(train_df_clean))  # Total missing values
## [1] 0

Pros

Cons

Splitting the training dataset

We split the original training dataset into a training and testing dataset in order to test the strength of our model without testing the model on data that the model has already seen. The training dataset will hold 75% of our original observations, while the test dataset will hold 25%.

smp_size <- floor(0.75 * nrow(train_df_clean))
 nrow(train_df_clean)
## [1] 2275
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(train_df_clean)), size = smp_size)

stp75_train_df <- train_df_clean[train_ind, ]
stp25_test_df <- train_df_clean[-train_ind, ]

Outliers

The Residuals vs. Fitted and QQ Plots show a fairly linear pattern, while Scale-Location plot suggest Homoscedasticity. However, the Residuals vs Leverage plot reveals the presence of some outliers.

stp_model_full <- lm(TARGET_WINS ~ ., data = stp75_train_df)
summary(stp_model_full)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = stp75_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.316  -8.649   0.045   8.329  57.170 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      17.6962800  6.3050518   2.807 0.005063 ** 
## TEAM_BATTING_H    0.0514744  0.0044700  11.516  < 2e-16 ***
## TEAM_BATTING_2B  -0.0224439  0.0106248  -2.112 0.034798 *  
## TEAM_BATTING_3B   0.0672376  0.0199897   3.364 0.000786 ***
## TEAM_BATTING_HR   0.0285624  0.0317578   0.899 0.368578    
## TEAM_BATTING_BB   0.0066057  0.0067404   0.980 0.327224    
## TEAM_BATTING_SO  -0.0034806  0.0029004  -1.200 0.230288    
## TEAM_BASERUN_SB   0.0278236  0.0051813   5.370 8.96e-08 ***
## TEAM_BASERUN_CS  -0.0136696  0.0181769  -0.752 0.452135    
## TEAM_PITCHING_H  -0.0008407  0.0004833  -1.740 0.082111 .  
## TEAM_PITCHING_HR  0.0239234  0.0289737   0.826 0.409095    
## TEAM_PITCHING_BB  0.0018163  0.0047792   0.380 0.703961    
## TEAM_PITCHING_SO  0.0021152  0.0009863   2.144 0.032138 *  
## TEAM_FIELDING_E  -0.0223272  0.0028410  -7.859 6.84e-15 ***
## TEAM_FIELDING_DP -0.1065083  0.0149272  -7.135 1.43e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.07 on 1691 degrees of freedom
## Multiple R-squared:  0.3031, Adjusted R-squared:  0.2973 
## F-statistic: 52.53 on 14 and 1691 DF,  p-value: < 2.2e-16
# check for outliers using cooks-distance plot
plot(stp_model_full, which = 4,  id.n = 8)

# get points of influence
influence <- influence.measures(stp_model_full)
influential_points <- influence$infmat
cooks_d <- influence$infmat[, "cook.d"]
max_influence_index <- which.max(cooks_d)

The observation with index 2135 is particularly problematic. A closer examination reveals TEAM_PITCHING_SO is almost 75% higher as the next highest value (19278 vs 12758). TEAM_PITCHING_H is also unusually high for this year.

influential_data_point <- stp75_train_df[max_influence_index, ]
print(influential_data_point)
##      TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 2135          41            992             263              20        100.2729
##      TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 2135             142             952          124.82        52.83899
##      TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 2135           20088         106.3998             2876            19278
##      TEAM_FIELDING_E TEAM_FIELDING_DP
## 2135             952         146.3879

We will drop this row since it has such a high leverage on the model.

# remove outlier 
stp75_train_df <- stp75_train_df |>
  filter(TEAM_PITCHING_SO < 15000)

# confirm outliers
stp_model_full <- lm(TARGET_WINS ~ ., data = stp75_train_df)
plot(stp_model_full, which = 4,  id.n = 8)

Charting the plot shows another point (202) with high influence. This record has a TEAM_PITCHING_SO that is more than twice the next closest value.

# get points of influence
influence <- influence.measures(stp_model_full)
influential_points <- influence$infmat
cooks_d <- influence$infmat[, "cook.d"]
max_influence_index <- which.max(cooks_d)
influential_data_point <- stp75_train_df[max_influence_index, ]
print(influential_data_point)
##     TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 202         108           1188             338               0        100.2729
##     TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 202             270             945          124.82        52.83899
##     TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 202           16038         106.3998             3645            12758
##     TEAM_FIELDING_E TEAM_FIELDING_DP
## 202             716         146.3879
# remove outlier at row 202
stp75_train_df <- stp75_train_df[-c(202), ]

# confirm outliers
stp_model_full <- lm(TARGET_WINS ~ ., data = stp75_train_df)

plot(stp_model_full, which = 4,  id.n = 8)

As the Cooks Distance (Di) value is less than 0.5 for all of the remaining outliers appear, they are not significantly influential and can be left in our dataset.

Building Three Multiple Linear Regression Models with Manual Variable Selection

Model 1: Base Baseball Stats Model

  • This model includes fundamental offensive, defensive, and pitching stats that logically contribute to wins.

Why were these variables selected?

TEAM_BATTING_H (Hits): More hits increase the chances of scoring.

TEAM_BATTING_HR (Home Runs): Home runs are a major contributor to runs.

TEAM_PITCHING_SO (Strikeouts): More strikeouts reduce opponent scoring.

TEAM_FIELDING_E (Errors): More errors lead to more opponent runs (negative predictor).

Why exclude some variables?

TEAM_BASERUN_SB (Stolen Bases): Limited impact on overall wins.

TEAM_PITCHING_BB (Walks Allowed): May not be as predictive when combined with strikeouts.

Model Summary Statistics

base_model <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_PITCHING_SO + TEAM_FIELDING_E, data = stp75_train_df)

# View model summary
summary(base_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E, data = stp75_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.067  -9.426   0.048   9.562  50.510 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      12.5084602  4.3458592   2.878  0.00405 ** 
## TEAM_BATTING_H    0.0501497  0.0028185  17.793  < 2e-16 ***
## TEAM_BATTING_HR  -0.0007802  0.0077359  -0.101  0.91968    
## TEAM_PITCHING_SO -0.0001984  0.0014372  -0.138  0.89021    
## TEAM_FIELDING_E  -0.0215516  0.0020224 -10.656  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.67 on 1699 degrees of freedom
## Multiple R-squared:  0.2297, Adjusted R-squared:  0.2279 
## F-statistic: 126.7 on 4 and 1699 DF,  p-value: < 2.2e-16

Interpreting the Coefficients of the Regression Model

1- Intercept (8.05)

When all predictor variables (TEAM_BATTING_H, TEAM_BATTING_HR, TEAM_PITCHING_SO, TEAM_FIELDING_E) are zero, a team is expected to have 8.05 wins. Significant (p = 0.04) → The intercept is meaningful in this context.

2- TEAM_BATTING_H (Hits) → 0.05 (p < 2e-16 *)

positive & highly significant Interpretation: For every additional hit, the team is expected to win 0.052 more games. Example: If a team gets 100 more hits in a season, they would be expected to win ~5 more games (100 × 0.05). Conclusion: More hits lead to more wins, which is expected in baseball.

3- TEAM_BATTING_HR (Home Runs) → -0.008 (p = 0.26026 - Not Significant) Negative (unexpected) & not significant

Interpretation: More home runs slightly decrease wins, but the effect is very small and not statistically significant. Example: Hitting 100 more home runs would decrease wins by 008, which doesn’t make sense. We should consider removing this variable.

Possible Issues: Multicollinearity: Home runs may be highly correlated with other batting stats (like hits or doubles), causing misleading coefficients. Outliers/Bad Teams: Some losing teams hit a lot of home runs but still lost, skewing results.

Solution: Check VIF (Variance Inflation Factor) for multicollinearity. Add an interaction term (e.g., TEAM_BATTING_HR * TEAM_BATTING_BB). Consider removing this variable if it remains insignificant.

4_ TEAM_PITCHING_SO (Strikeouts by Pitchers) → +0.0025 (p = 0.005) Weakly positive, borderline significant

Interpretation: More strikeouts slightly increase wins, but the effect is very small tough statistically significant (p = 0.005). Example: If a team strikes out 500 more batters in a season, they would win 0.5 more games.

Conclusion: Strikeouts help teams win, but they are not the strongest predictor of wins. The effect might be hidden by other defensive factors (e.g., walks, home runs allowed).

Solution: Consider adding walks (TEAM_PITCHING_BB) or earned run average (ERA) to capture pitching effectiveness better.

5- TEAM_FIELDING_E (Errors) → -0.023 (p < 2e-16 *) Negative & highly significant

Interpretation: For every additional error, a team is expected to lose 0.023 more games. Example: A team with 50 more errors in a season would lose ~1 more game (50 × 0.021). Conclusion: More errors directly hurt a team’s chances of winning, which makes sense in baseball.

Key Metric Interpretation:

Residual Std. Error 13.78 The average prediction error is ~13.78 wins.

Adjusted R² 0.2349 The model explains 23.5% of variance in TARGET_WINS (not very strong).

F-Statistic 175.6 (p < 2.2e-16) The overall model is statistically significant.

p < 2.2e-16 → The probability of getting this result by random chance is essentially 0 (very small).

Conclusion:

At least one of the predictor variables in the model significantly affects TARGET_WINS.

If the p-value is very small (< 0.05), we reject the null hypothesis that “none of the independent variables explain wins.”

Our model as a whole is meaningful and explains a significant amount of variation in team wins.

At least one of our predictors (TEAM_BATTING_H, TEAM_BATTING_HR, TEAM_PITCHING_SO, TEAM_FIELDING_E) is statistically significant in predicting wins.

The F-statistic of 175.6 (p < 2.2e-16) means our model is highly statistically significant. This confirms that at least one of our variables—such as home runs, hits, strikeouts, or errors—has a real impact on predicting wins.

However, we still need to check which specific variables are the most meaningful (p-values of individual coefficients) and whether we can improve the model further.

Implementing Improvement in the Model:

Step 1: Check for Multicollinearity (VIF Test)

Multicollinearity occurs when predictor variables are highly correlated, leading to unstable coefficients and inflated standard errors.

We use the Variance Inflation Factor (VIF) test. A VIF > 5 suggests multicollinearity, and VIF > 10 is a strong sign of redundancy.

Run VIF test in R:
library(car)
vif(base_model)
##   TEAM_BATTING_H  TEAM_BATTING_HR TEAM_PITCHING_SO  TEAM_FIELDING_E 
##         1.383400         2.013796         1.541450         1.824341
Interpretation

Since all VIF values are below 5, there is no significant multicollinearity in the model. This means:

Each predictor contributes unique information to explaining TARGET_WINS. Regression coefficients are stable, and we do not need to remove any variables due to multicollinearity. The model is not distorted by highly correlated predictors.

Step 2: Test Interaction Term (TEAM_BATTING_HR * TEAM_BATTING_BB)

If a team hits more home runs and draws more walks, they likely score more runs. We test if walks amplify the impact of home runs on wins.

interaction_model <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BATTING_BB + 
                          TEAM_BATTING_HR:TEAM_BATTING_BB + 
                          TEAM_PITCHING_SO + TEAM_FIELDING_E, 
                        data = stp75_train_df)

summary(interaction_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BATTING_HR:TEAM_BATTING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E, data = stp75_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.107  -8.711   0.072   9.142  52.211 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     27.7763894  5.3669583   5.175 2.54e-07 ***
## TEAM_BATTING_H                   0.0479284  0.0027582  17.377  < 2e-16 ***
## TEAM_BATTING_HR                 -0.2409574  0.0282343  -8.534  < 2e-16 ***
## TEAM_BATTING_BB                 -0.0205540  0.0054979  -3.739 0.000191 ***
## TEAM_PITCHING_SO                -0.0008999  0.0014301  -0.629 0.529258    
## TEAM_FIELDING_E                 -0.0235317  0.0023355 -10.076  < 2e-16 ***
## TEAM_BATTING_HR:TEAM_BATTING_BB  0.0004341  0.0000507   8.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.33 on 1697 degrees of freedom
## Multiple R-squared:  0.2685, Adjusted R-squared:  0.2659 
## F-statistic: 103.8 on 6 and 1697 DF,  p-value: < 2.2e-16
Interpretation:

Model now includes an interaction term (TEAM_BATTING_HR:TEAM_BATTING_BB) to see if walks (BB) affect the impact of home runs (HR) on wins. Let’s break down the results.

The average prediction error is ±13.31 wins, slightly better than before (13.78). The model explains 27.2% of the variance in wins (up from 23.6% in the original model). Similar to R², meaning additional predictors added value to the model. The model as a whole is statistically significant (at least one predictor explains wins).

More Home Runs (HR) Alone → Fewer Wins (Unexpected): The negative coefficient (-0.1650) on TEAM_BATTING_HR suggests that hitting more home runs alone does not necessarily lead to more wins.

Walks (BB) Alone Have a Weak Impact on Wins: The coefficient for TEAM_BATTING_BB is negative (-0.0074) and not statistically significant (p = 0.1387). This means that walks alone do not have a strong impact on wins.

The Interaction Term (TEAM_BATTING_HR * TEAM_BATTING_BB) is Highly Significant (p = 3.39e-10) Positive Coefficient (+0.000301)

Teams that hit home runs AND get on base with walks tend to win more games. This confirms that home runs are more valuable when combined with walks.

Visualizing the Interaction Effect:

We want to see how home runs (TEAM_BATTING_HR) and walks (TEAM_BATTING_BB) impact wins (TARGET_WINS) together.

library(ggplot2)

ggplot(stp75_train_df, aes(x = TEAM_BATTING_HR, y = TARGET_WINS, color = TEAM_BATTING_BB)) +
  geom_point(alpha = 0.7) + 
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Interaction Effect of Home Runs and Walks on Wins",
       x = "Home Runs",
       y = "Wins",
       color = "Walks (BB)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Red (high walks) teams should have higher wins for the same HRs. Blue (low walks) teams may not benefit as much from HRs. The trendline is steeper for teams with more walks, confirming that walks amplify HR impact.

Adding TEAM_BATTING_2B (Doubles) to the Model:

Doubles (2B) are a strong indicator of offensive power and often correlate with scoring more runs. If a team doesn’t hit home runs, but hits many doubles, it can still score efficiently.

improved_model <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BATTING_BB + 
                        TEAM_BATTING_2B + TEAM_BATTING_HR:TEAM_BATTING_BB + 
                        TEAM_PITCHING_SO + TEAM_FIELDING_E, 
                      data = stp75_train_df)

summary(improved_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_BATTING_2B + TEAM_BATTING_HR:TEAM_BATTING_BB + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E, data = stp75_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.567  -8.618   0.106   8.699  55.401 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      2.242e+01  5.560e+00   4.033 5.76e-05 ***
## TEAM_BATTING_H                   5.686e-02  3.736e-03  15.221  < 2e-16 ***
## TEAM_BATTING_HR                 -2.337e-01  2.821e-02  -8.284 2.38e-16 ***
## TEAM_BATTING_BB                 -1.997e-02  5.482e-03  -3.643 0.000277 ***
## TEAM_BATTING_2B                 -3.717e-02  1.053e-02  -3.531 0.000425 ***
## TEAM_PITCHING_SO                 3.299e-04  1.467e-03   0.225 0.822121    
## TEAM_FIELDING_E                 -2.588e-02  2.421e-03 -10.691  < 2e-16 ***
## TEAM_BATTING_HR:TEAM_BATTING_BB  4.299e-04  5.054e-05   8.505  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.28 on 1696 degrees of freedom
## Multiple R-squared:  0.2738, Adjusted R-squared:  0.2708 
## F-statistic: 91.37 on 7 and 1696 DF,  p-value: < 2.2e-16
Interpretation:

The average prediction error is ±13.42 wins (slightly better than before). The model explains 27.9% of the variance in wins (slightly better than 27.2% in the previous model). Adjusted for number of predictors (still an improvement from before). The model is statistically significant overall.

Improvement After Adding TEAM_BATTING_2B (Doubles)?

Model performance improved slightly (R² increased from 27.2% → 27.9%). Doubles (TEAM_BATTING_2B) unexpectedly have a negative impact on wins.

Possible issue: Doubles may be highly correlated with other batting stats (e.g., hits, HRs). Solution: We will Check multicollinearity (VIF) or add an interaction term (e.g., TEAM_BATTING_2B * TEAM_BATTING_H). HRs alone are still negative (-0.1612), but the interaction term remains strong. Conclusion: HRs are only useful when paired with walks.

Key Findings

Adding TEAM_BATTING_2B slightly improves model performance

R² increased from 27.2% → 27.9% (small improvement). Residual Standard Error decreased from 13.31 → 13.25 (better fit). Unexpected negative coefficient for doubles (-0.0429, p = 3.09e-06)

Suggests that more doubles lead to fewer wins, which is counterintuitive. Possible reasons: Multicollinearity with TEAM_BATTING_H (hits). Bad teams might hit many doubles but still lose. Interaction term (TEAM_BATTING_HR * TEAM_BATTING_BB) remains strong and positive

Confirms that HRs are more valuable when combined with walks. Suggests plate discipline (BBs) is crucial for power hitters. Decision: Should we keep TEAM_BATTING_2B?

If VIF test shows high correlation with TEAM_BATTING_H, we should drop it. If interaction terms (e.g., TEAM_BATTING_2B * TEAM_BATTING_H) make sense, we could try that instead.

Test Four Assumptions:

Linearity
plot(improved_model, which=1)

Our diagnostic plots show a fairly linear model. ##### Normality Check:

plot(improved_model, which=2)

shapiro.test(residuals(improved_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(improved_model)
## W = 0.99714, p-value = 0.003293
# Shapiro-Wilk normality test: look for high p-value

Our QQ plot suggests normality thought there is obvious skewing on the tails, particularly on the right.

A Shapiro Wilk’s Test statistic had a value of W = 0.9971 suggesting normality.

However, the p-value (0.0033) is less than < 0.05 suggesting that residuals do not follow a normal distribution.

Heterocedasticity:
plot(improved_model, which=3)

bptest(improved_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  improved_model
## BP = 200.46, df = 7, p-value < 2.2e-16
# Breusch-Pagan test; look for high p-value

Test Statistic (BP = 200.46) suggest higher evidence of heteroscedasticity. Additionally, the p-value is extremely small (much less than 0.05), suggesting higher evidence of heteroscedasticity and that we may need to transform the data to meet the Assumption of Homoscedasticity.

Independence:
acf(residuals(improved_model))

durbinWatsonTest(improved_model)
##  lag Autocorrelation D-W Statistic p-value
##    1     0.002609554      1.991934   0.884
##  Alternative hypothesis: rho != 0
# Durbin Watson should be close to 2

Our Autocorrelation Function shows that there are lags above the blue dashed line, suggesting no autocorrelation. This is confirmed through a Durbin-Watson test statistic value of 2.03 and an autocorrelation value of -0.0017. Furthermore, as our p-value (0.558) is greater than 0.05, we do not have enough evidence to reject the null hypothesis that there is no autocorrelation. In other words, the test results suggest that our model’s residuals are independent and therefore do not violate the Independence Assumption.

Multicolinearity:
# Check Variance Inflation Factor (VIF)
vif(improved_model)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##                  TEAM_BATTING_H                 TEAM_BATTING_HR 
##                        2.573335                       28.363619 
##                 TEAM_BATTING_BB                 TEAM_BATTING_2B 
##                        4.310450                        2.339192 
##                TEAM_PITCHING_SO                 TEAM_FIELDING_E 
##                        1.701116                        2.767363 
## TEAM_BATTING_HR:TEAM_BATTING_BB 
##                       34.483096

TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_PITCHING_SO, TEAM_FIELDING_E are in the range of (1-5) - No significant multicollinearity (good)

TEAM_BATTING_HR, TEAM_BATTING_HR:TEAM_BATTING_BB are in the range of (> 10) shows Severe multicollinearity (highly problematic).

Model 2: High-Impact Features Model (Based on Correlation & VIF)

We select variables based on correlation with TARGET_WINS and ensure they are not highly correlated with each other (VIF < 5).

library(car)

# Manually selected high-impact variables
high_impact_model <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_HR +
                        TEAM_PITCHING_HR + TEAM_PITCHING_SO + TEAM_FIELDING_E, data = stp75_train_df)

# View model summary
summary(high_impact_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_HR + TEAM_PITCHING_HR + TEAM_PITCHING_SO + TEAM_FIELDING_E, 
##     data = stp75_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.719  -9.068   0.023   9.382  53.464 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.8569818  4.9380875   1.794  0.07305 .  
## TEAM_BATTING_H    0.0584401  0.0039920  14.639  < 2e-16 ***
## TEAM_BATTING_2B  -0.0379419  0.0108106  -3.510  0.00046 ***
## TEAM_BATTING_HR  -0.0179674  0.0285873  -0.629  0.52975    
## TEAM_PITCHING_HR  0.0230521  0.0282249   0.817  0.41420    
## TEAM_PITCHING_SO  0.0005344  0.0016008   0.334  0.73854    
## TEAM_FIELDING_E  -0.0243297  0.0021563 -11.283  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.62 on 1697 degrees of freedom
## Multiple R-squared:  0.2357, Adjusted R-squared:  0.233 
## F-statistic: 87.24 on 6 and 1697 DF,  p-value: < 2.2e-16
plot(high_impact_model)

# Check for multicollinearity
vif(high_impact_model)
##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_HR TEAM_PITCHING_HR 
##         2.793787         2.345585        27.684325        27.666942 
## TEAM_PITCHING_SO  TEAM_FIELDING_E 
##         1.925030         2.087761

Why these variables?

TEAM_BATTING_2B (Doubles): Important for offensive production.

TEAM_PITCHING_HR (Home Runs Allowed): Directly impacts opponent scoring (negative impact).

Removed highly correlated variables (VIF > 5).

What changed?

Better feature selection → Based on both domain knowledge and correlation analysis.

Removes redundant variables that cause multicollinearity.

Model 3: Log-Transformed Model (Handling Skewness & Outliers)

  • Some baseball statistics (e.g., Home Runs, Strikeouts) have skewed distributions. We apply log transformation to stabilize variance.
# Apply log transformation to selected variables
train_df_log <- stp75_train_df %>%
  mutate(
    log_BATTING_H = log1p(TEAM_BATTING_H),
    log_BATTING_HR = log1p(TEAM_BATTING_HR),
    log_PITCHING_SO = log1p(TEAM_PITCHING_SO)
  )


# Fit model with transformed variables
log_model <- lm(TARGET_WINS ~ log_BATTING_H + log_BATTING_HR + log_PITCHING_SO + TEAM_FIELDING_E, data = train_df_log)

# View model summary
summary(log_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ log_BATTING_H + log_BATTING_HR + log_PITCHING_SO + 
##     TEAM_FIELDING_E, data = train_df_log)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.859  -9.396   0.188   9.407  48.822 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -5.695e+02  3.753e+01 -15.174  < 2e-16 ***
## log_BATTING_H    8.754e+01  4.575e+00  19.136  < 2e-16 ***
## log_BATTING_HR  -2.767e+00  5.976e-01  -4.631 3.91e-06 ***
## log_PITCHING_SO  4.619e+00  1.346e+00   3.431 0.000617 ***
## TEAM_FIELDING_E -2.713e-02  2.103e-03 -12.899  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.59 on 1699 degrees of freedom
## Multiple R-squared:  0.238,  Adjusted R-squared:  0.2362 
## F-statistic: 132.6 on 4 and 1699 DF,  p-value: < 2.2e-16
plot(log_model)

#### Why log transformation?

Fixes right-skewed distributions (e.g., extreme HR and SO values).

Helps meet the linear regression assumption of normality.

Reduces outlier influence.

What changed?

More stable regression coefficients with reduced variability.

Improves the model fit for non-linear relationships.

Comparision between three Models:

  • Compare Adjusted R² across models:
summary(base_model)$adj.r.squared
## [1] 0.2279034
summary(high_impact_model)$adj.r.squared
## [1] 0.2330402
summary(log_model)$adj.r.squared
## [1] 0.236158
  • Check Mean Squared Error (MSE):
mse_base <- mean((stp75_train_df$TARGET_WINS - predict(base_model, stp75_train_df))^2)
mse_high_impact <- mean((stp75_train_df$TARGET_WINS - predict(high_impact_model, stp75_train_df))^2)
mse_log <- mean((train_df_log$TARGET_WINS - predict(log_model, train_df_log))^2)

print(c(mse_base, mse_high_impact, mse_log))
## [1] 186.2483 184.7914 184.2571
  • Evaluate Multicollinearity (VIF check):
vif(high_impact_model)
##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_HR TEAM_PITCHING_HR 
##         2.793787         2.345585        27.684325        27.666942 
## TEAM_PITCHING_SO  TEAM_FIELDING_E 
##         1.925030         2.087761

Model 4: Variable Selection Using Backward Selection

For this model, we are using backward selection to arrive at the smallest number of variables with statistical significance. In this case, we begin by creating a model containing all predictors, then gradually remove the variable with the highest P-value until only the variables with statistical significance remain.

Best Subset Selection

We will use Best Subset Selection to get a estimate of the optimal number of predictors in our model. This test uses Mallows’ Cp.

library(leaps)
regfit_full = regsubsets(TARGET_WINS ~ ., data = stp75_train_df, nvmax = 11)
regfit_summary = summary(regfit_full)
plot(regfit_summary$cp, xlab="Number of variables", ylab="cp")
points(which.min(regfit_summary$cp),regfit_summary$cp[which.min(regfit_summary$cp)], pch=20,col="red")

Based on the Best Subset Selection method, we estimate that our model should have 9 observations.

Cross Validation

As an alternative to Best Subset Selection, we used the Cross Validation method to estimate the optimal number of predictors in our model. Cross Validation divides our training dataset into k - 1 number of “folds”, then tests the data on the kth “fold”. For our test, we used five folds (k = 5).

set.seed(11)
folds=sample(rep(1:5,length=nrow(stp75_train_df)))

cv_errors = matrix(NA,5,10)
for(k in 1:5) {
  best_fit = regsubsets(TARGET_WINS ~ ., data=stp75_train_df[folds!=k,], nvmax=10, method="forward")
  for(i in 1:10) {
    # Extract the selected coefficients for the i-th model
    selected_coefs = coef(best_fit, id = i)
    
    # Predict manually by calculating the linear combination of the features
    # First, subset the data for the k-th fold
    test_data = stp75_train_df[folds == k, ]
    
    # Only include the predictors that were selected
    predictors = names(selected_coefs)[-1]  # Exclude the intercept term
    
    # Calculate the predictions (including the intercept)
    pred = as.matrix(test_data[, predictors]) %*% selected_coefs[predictors] + selected_coefs[1]  
    
    cv_errors[k,i]=mean((stp75_train_df$TARGET_WINS[folds==k] - pred)^2)
  }
}

rmse_cv = sqrt(apply(cv_errors,2,mean))
plot(rmse_cv, pch=5, type="b")

Based on the Cross Validation method, we estimate that our model should have 9 observations.

Backward Selection Method
summary(stp_model_full)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = stp75_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.770  -8.578   0.104   8.339  57.289 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      18.4674648  6.2805997   2.940 0.003322 ** 
## TEAM_BATTING_H    0.0501921  0.0044617  11.250  < 2e-16 ***
## TEAM_BATTING_2B  -0.0232963  0.0105809  -2.202 0.027818 *  
## TEAM_BATTING_3B   0.0704114  0.0199153   3.536 0.000418 ***
## TEAM_BATTING_HR  -0.0143389  0.0352355  -0.407 0.684100    
## TEAM_BATTING_BB   0.0167559  0.0073216   2.289 0.022228 *  
## TEAM_BATTING_SO  -0.0028343  0.0038858  -0.729 0.465858    
## TEAM_BASERUN_SB   0.0286960  0.0051909   5.528 3.74e-08 ***
## TEAM_BASERUN_CS  -0.0149534  0.0180982  -0.826 0.408787    
## TEAM_PITCHING_H  -0.0005012  0.0004910  -1.021 0.307539    
## TEAM_PITCHING_HR  0.0660800  0.0323533   2.042 0.041262 *  
## TEAM_PITCHING_BB -0.0071463  0.0054045  -1.322 0.186252    
## TEAM_PITCHING_SO  0.0013541  0.0024422   0.554 0.579346    
## TEAM_FIELDING_E  -0.0213156  0.0029091  -7.327 3.63e-13 ***
## TEAM_FIELDING_DP -0.1061426  0.0148629  -7.141 1.37e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.01 on 1689 degrees of freedom
## Multiple R-squared:  0.3063, Adjusted R-squared:  0.3005 
## F-statistic: 53.26 on 14 and 1689 DF,  p-value: < 2.2e-16
AIC(stp_model_full)
## [1] 13596.37

TEAM_BATTING_HR has the highest p-value. We removed TEAM_BATTING_HR from our predictors and updated the model.

back_select_model <- update(stp_model_full, . ~ . - TEAM_BATTING_HR)
summary(back_select_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stp75_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.787  -8.579   0.087   8.351  57.401 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      18.5254526  6.2774327   2.951 0.003210 ** 
## TEAM_BATTING_H    0.0502361  0.0044592  11.266  < 2e-16 ***
## TEAM_BATTING_2B  -0.0233482  0.0105775  -2.207 0.027423 *  
## TEAM_BATTING_3B   0.0712736  0.0197974   3.600 0.000327 ***
## TEAM_BATTING_BB   0.0156323  0.0067793   2.306 0.021238 *  
## TEAM_BATTING_SO  -0.0032609  0.0037408  -0.872 0.383494    
## TEAM_BASERUN_SB   0.0287248  0.0051892   5.536 3.59e-08 ***
## TEAM_BASERUN_CS  -0.0143034  0.0180232  -0.794 0.427532    
## TEAM_PITCHING_H  -0.0005316  0.0004851  -1.096 0.273307    
## TEAM_PITCHING_HR  0.0535904  0.0102340   5.236 1.84e-07 ***
## TEAM_PITCHING_BB -0.0062913  0.0049782  -1.264 0.206485    
## TEAM_PITCHING_SO  0.0015803  0.0023776   0.665 0.506359    
## TEAM_FIELDING_E  -0.0213046  0.0029083  -7.326 3.67e-13 ***
## TEAM_FIELDING_DP -0.1063550  0.0148500  -7.162 1.18e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.01 on 1690 degrees of freedom
## Multiple R-squared:  0.3062, Adjusted R-squared:  0.3008 
## F-statistic: 57.37 on 13 and 1690 DF,  p-value: < 2.2e-16
AIC(back_select_model)
## [1] 13594.53

TEAM_PITCHING_SO has the highest p-value. We removed TEAM_PITCHING_SO from our predictors and update the model.

back_select_model <- update(back_select_model, . ~ . - TEAM_PITCHING_SO)
summary(back_select_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stp75_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.829  -8.630   0.114   8.304  57.451 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      18.7441482  6.2677688   2.991 0.002825 ** 
## TEAM_BATTING_H    0.0500658  0.0044511  11.248  < 2e-16 ***
## TEAM_BATTING_2B  -0.0229972  0.0105625  -2.177 0.029601 *  
## TEAM_BATTING_3B   0.0720801  0.0197569   3.648 0.000272 ***
## TEAM_BATTING_BB   0.0133025  0.0058018   2.293 0.021981 *  
## TEAM_BATTING_SO  -0.0014733  0.0025996  -0.567 0.570976    
## TEAM_BASERUN_SB   0.0283374  0.0051555   5.497 4.46e-08 ***
## TEAM_BASERUN_CS  -0.0139882  0.0180140  -0.777 0.437549    
## TEAM_PITCHING_H  -0.0005686  0.0004819  -1.180 0.238199    
## TEAM_PITCHING_HR  0.0536376  0.0102321   5.242 1.79e-07 ***
## TEAM_PITCHING_BB -0.0043455  0.0040258  -1.079 0.280550    
## TEAM_FIELDING_E  -0.0209759  0.0028655  -7.320 3.81e-13 ***
## TEAM_FIELDING_DP -0.1065292  0.0148453  -7.176 1.07e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 1691 degrees of freedom
## Multiple R-squared:  0.306,  Adjusted R-squared:  0.3011 
## F-statistic: 62.13 on 12 and 1691 DF,  p-value: < 2.2e-16
AIC(back_select_model)
## [1] 13592.98

TEAM_BATTING_SO has the highest p-value. We removed TEAM_BATTING_SO from our predictors and update the model.

back_select_model <- update(back_select_model, . ~ . - TEAM_BATTING_SO)
summary(back_select_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
##     TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP, data = stp75_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727  -8.628   0.086   8.333  57.129 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      16.1571222  4.2941305   3.763 0.000174 ***
## TEAM_BATTING_H    0.0513371  0.0038438  13.356  < 2e-16 ***
## TEAM_BATTING_2B  -0.0246716  0.0101389  -2.433 0.015062 *  
## TEAM_BATTING_3B   0.0744296  0.0193132   3.854 0.000121 ***
## TEAM_BATTING_BB   0.0133550  0.0057999   2.303 0.021420 *  
## TEAM_BASERUN_SB   0.0274615  0.0049174   5.585 2.73e-08 ***
## TEAM_BASERUN_CS  -0.0137429  0.0180051  -0.763 0.445404    
## TEAM_PITCHING_H  -0.0005572  0.0004814  -1.158 0.247196    
## TEAM_PITCHING_HR  0.0503000  0.0083657   6.013 2.23e-09 ***
## TEAM_PITCHING_BB -0.0039640  0.0039683  -0.999 0.317976    
## TEAM_FIELDING_E  -0.0209665  0.0028648  -7.319 3.85e-13 ***
## TEAM_FIELDING_DP -0.1058938  0.0147999  -7.155 1.24e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 1692 degrees of freedom
## Multiple R-squared:  0.3059, Adjusted R-squared:  0.3014 
## F-statistic: 67.78 on 11 and 1692 DF,  p-value: < 2.2e-16
AIC(back_select_model)
## [1] 13591.3

TEAM_BASERUN_CS has the highest p-value. We removed TEAM_BASERUN_CS from our predictors and update the model.

back_select_model <- update(back_select_model, . ~ . - TEAM_BASERUN_CS)
summary(back_select_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_H + 
##     TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = stp75_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.649  -8.602   0.012   8.342  57.080 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      15.3222848  4.1519888   3.690 0.000231 ***
## TEAM_BATTING_H    0.0513101  0.0038431  13.351  < 2e-16 ***
## TEAM_BATTING_2B  -0.0249908  0.0101290  -2.467 0.013714 *  
## TEAM_BATTING_3B   0.0744461  0.0193108   3.855 0.000120 ***
## TEAM_BATTING_BB   0.0136940  0.0057821   2.368 0.017981 *  
## TEAM_BASERUN_SB   0.0265482  0.0047690   5.567 3.01e-08 ***
## TEAM_PITCHING_H  -0.0005841  0.0004800  -1.217 0.223869    
## TEAM_PITCHING_HR  0.0517755  0.0081382   6.362 2.56e-10 ***
## TEAM_PITCHING_BB -0.0039681  0.0039678  -1.000 0.317412    
## TEAM_FIELDING_E  -0.0204796  0.0027926  -7.334 3.46e-13 ***
## TEAM_FIELDING_DP -0.1063293  0.0147871  -7.191 9.63e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 1693 degrees of freedom
## Multiple R-squared:  0.3056, Adjusted R-squared:  0.3015 
## F-statistic: 74.52 on 10 and 1693 DF,  p-value: < 2.2e-16
AIC(back_select_model)
## [1] 13589.89

TEAM_PITCHING_BB has the highest p-value. We removed TEAM_PITCHING_BB from our predictors and update the model.

back_select_model <- update(back_select_model, . ~ . - TEAM_PITCHING_BB)
summary(back_select_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_H + 
##     TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stp75_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.340  -8.550  -0.023   8.399  56.895 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      15.2339468  4.1510493   3.670 0.000250 ***
## TEAM_BATTING_H    0.0521105  0.0037589  13.863  < 2e-16 ***
## TEAM_BATTING_2B  -0.0253972  0.0101209  -2.509 0.012186 *  
## TEAM_BATTING_3B   0.0709375  0.0189894   3.736 0.000193 ***
## TEAM_BATTING_BB   0.0093133  0.0037744   2.468 0.013704 *  
## TEAM_BASERUN_SB   0.0258288  0.0047145   5.479 4.93e-08 ***
## TEAM_PITCHING_H  -0.0008524  0.0003980  -2.141 0.032388 *  
## TEAM_PITCHING_HR  0.0506989  0.0080667   6.285 4.16e-10 ***
## TEAM_FIELDING_E  -0.0208150  0.0027724  -7.508 9.65e-14 ***
## TEAM_FIELDING_DP -0.1065033  0.0147861  -7.203 8.83e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 1694 degrees of freedom
## Multiple R-squared:  0.3052, Adjusted R-squared:  0.3015 
## F-statistic: 82.69 on 9 and 1694 DF,  p-value: < 2.2e-16
AIC(back_select_model)
## [1] 13588.9

Backward selection using P-values arrives at a model with six variables with high statistical significance ((p-value < 0.001) and three with moderate (p-value ~= 0.1) statistical significance. Arriving at 11 predictors is consistent with our Best Subset Selection test.

vif(back_select_model)
##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_BB 
##         2.719827         2.257416         2.673757         2.133184 
##  TEAM_BASERUN_SB  TEAM_PITCHING_H TEAM_PITCHING_HR  TEAM_FIELDING_E 
##         1.587993         2.329005         2.481519         3.789448 
## TEAM_FIELDING_DP 
##         1.339361
plot(back_select_model)

4B: Alternate Model Selection

Another approach to model selection involves a combination of step-wise selection and contextual selection. In this example, we can first drop TEAM_BATTING_2B (2 base hits), TEAM_BATTING_3B (3 base hits), TEAM_BATTING_HR (home runs) before conducting our stepwise selection, as data for these columns is represented in the TEAM_BATTING_H (total hits) column, thus signaling the existence of dependence between hit count variables and total hits. It is not totally clear if TEAM_PITCHING_HR is also included in TEAM_PITCHING_H from the supporting document but we will remove it in the event that a similar relationship exists. Alternatively, we could have created a new column (TEAM_BATTING_1B) by subtracting the sum of 2 base hits, 3 base hits, and homeruns from total hits, but this alternatively solution would have a) added complexity into our model and b) added uncertainty due to possible missing values in the home runs columns.

stp75s_train_df <- stp75_train_df |> 
    drop_columns(c("TEAM_BATTING_2B", "TEAM_BATTING_3B", "TEAM_BATTING_HR", "TEAM_PITCHING_HR"))

stp25s_test_df <- stp25_test_df |> 
    drop_columns(c("TEAM_BATTING_2B", "TEAM_BATTING_3B", "TEAM_BATTING_HR", "TEAM_PITCHING_HR"))

stp_model_sm <- lm(TARGET_WINS ~ ., data = stp75s_train_df)

# Backward step-wise regression
stpb_model_sm <- step(stp_model_sm, direction = "backward")
## Start:  AIC=8789.49
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_BB + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq    RSS    AIC
## - TEAM_BATTING_SO   1         1 292429 8787.5
## - TEAM_PITCHING_BB  1        33 292461 8787.7
## - TEAM_PITCHING_SO  1        91 292519 8788.0
## - TEAM_PITCHING_H   1       269 292697 8789.1
## <none>                          292428 8789.5
## - TEAM_BASERUN_CS   1       525 292954 8790.5
## - TEAM_BATTING_BB   1       909 293338 8792.8
## - TEAM_BASERUN_SB   1      6102 298530 8822.7
## - TEAM_FIELDING_DP  1      7670 300099 8831.6
## - TEAM_FIELDING_E   1      9439 301868 8841.6
## - TEAM_BATTING_H    1     66689 359117 9137.5
## 
## Step:  AIC=8787.5
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BASERUN_SB + 
##     TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq    RSS    AIC
## - TEAM_PITCHING_BB  1        57 292487 8785.8
## - TEAM_PITCHING_H   1       272 292701 8787.1
## <none>                          292429 8787.5
## - TEAM_PITCHING_SO  1       367 292797 8787.6
## - TEAM_BASERUN_CS   1       541 292970 8788.6
## - TEAM_BATTING_BB   1      1194 293623 8792.4
## - TEAM_BASERUN_SB   1      6215 298644 8821.3
## - TEAM_FIELDING_DP  1      7692 300122 8829.7
## - TEAM_FIELDING_E   1     10473 302903 8845.5
## - TEAM_BATTING_H    1     69454 361883 9148.6
## 
## Step:  AIC=8785.83
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BASERUN_SB + 
##     TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP
## 
##                    Df Sum of Sq    RSS    AIC
## - TEAM_PITCHING_SO  1       313 292799 8785.7
## <none>                          292487 8785.8
## - TEAM_BASERUN_CS   1       553 293040 8787.1
## - TEAM_PITCHING_H   1       573 293059 8787.2
## - TEAM_BATTING_BB   1      2122 294609 8796.2
## - TEAM_BASERUN_SB   1      6195 298682 8819.5
## - TEAM_FIELDING_DP  1      7687 300173 8828.0
## - TEAM_FIELDING_E   1     11084 303571 8847.2
## - TEAM_BATTING_H    1     69638 362125 9147.8
## 
## Step:  AIC=8785.65
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BASERUN_SB + 
##     TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                          292799 8785.7
## - TEAM_PITCHING_H   1       410 293209 8786.0
## - TEAM_BASERUN_CS   1       694 293494 8787.7
## - TEAM_BATTING_BB   1      2042 294842 8795.5
## - TEAM_BASERUN_SB   1      6320 299119 8820.0
## - TEAM_FIELDING_DP  1      7515 300315 8826.8
## - TEAM_FIELDING_E   1     11574 304373 8849.7
## - TEAM_BATTING_H    1     77679 370478 9184.6

Performing an automated step-wise backward selection gives us 8 viable predictors. Examining the VIF shows that eight parameters are not strongly correlated with our dependent variable.

vif(stpb_model_sm)
##   TEAM_BATTING_H  TEAM_BATTING_BB  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##         1.207011         2.074158         1.566007         1.132796 
##  TEAM_PITCHING_H  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         2.143770         3.396556         1.291391
summary(stpb_model_sm)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP, data = stp75s_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.396  -8.750   0.183   8.431  52.491 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      14.1313943  4.1032804   3.444 0.000587 ***
## TEAM_BATTING_H    0.0536872  0.0025310  21.212  < 2e-16 ***
## TEAM_BATTING_BB   0.0129380  0.0037619   3.439 0.000597 ***
## TEAM_BASERUN_SB   0.0286310  0.0047321   6.050 1.77e-09 ***
## TEAM_BASERUN_CS  -0.0351827  0.0175441  -2.005 0.045081 *  
## TEAM_PITCHING_H  -0.0005947  0.0003860  -1.541 0.123606    
## TEAM_FIELDING_E  -0.0217219  0.0026530  -8.188 5.17e-16 ***
## TEAM_FIELDING_DP -0.0968235  0.0146751  -6.598 5.56e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.14 on 1696 degrees of freedom
## Multiple R-squared:  0.2893, Adjusted R-squared:  0.2864 
## F-statistic: 98.65 on 7 and 1696 DF,  p-value: < 2.2e-16
AIC(stpb_model_sm)
## [1] 13623.39
plot(stpb_model_sm)

Looking at the p-values and t-values for the remaining 7 predictors show that there is one variables without statistical significance, TEAM_PITCHING_H. As this predictor has a high p-value (0.12) and a low t-value (-1.54), we should consider removing it.

stpb_model_xs <- update(stpb_model_sm, . ~ . - TEAM_PITCHING_H)
summary(stpb_model_xs)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = stp75s_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.070  -8.834   0.110   8.350  53.599 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      15.111949   4.055255   3.727 0.000201 ***
## TEAM_BATTING_H    0.052652   0.002441  21.569  < 2e-16 ***
## TEAM_BATTING_BB   0.012851   0.003763   3.415 0.000653 ***
## TEAM_BASERUN_SB   0.030592   0.004559   6.710 2.65e-11 ***
## TEAM_BASERUN_CS  -0.035804   0.017547  -2.040 0.041456 *  
## TEAM_FIELDING_E  -0.024057   0.002178 -11.045  < 2e-16 ***
## TEAM_FIELDING_DP -0.097476   0.014675  -6.642 4.15e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.14 on 1697 degrees of freedom
## Multiple R-squared:  0.2884, Adjusted R-squared:  0.2858 
## F-statistic: 114.6 on 6 and 1697 DF,  p-value: < 2.2e-16
AIC(stpb_model_xs)
## [1] 13623.78
plot(stpb_model_xs)

Removing the TEAM_PITCHING_H slightly increases our AIC from 13623.39 to 13623.78 and slightly decreases the Adjusted R-squared from 0.2864 to 0.2858, but overall the models are somewhat similar the auto-generated step-wise selected model.

Variant Comparison

The function compare_performance allows comparison of various metrics across diffrent models including R2 and R2 (adj.). It also weights AIC and BIC values to help rank the performance of our models. Based on the result of this test, it appears that our original backward-selected model performed best.

compare_performance(back_select_model, stpb_model_sm, stpb_model_xs, rank = TRUE)

Model Diagnostics

In this section, we will test if our original backward-selected model meets the four assumptions for linearity.

Linearity
plot(back_select_model, which=1)

Our diagnostic plots show a fairly linear model.

Normality
plot(back_select_model, which=2)

shapiro.test(residuals(back_select_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(back_select_model)
## W = 0.9967, p-value = 0.001034
# Shapiro-Wilk normality test: look for high p-value

Our QQ plot suggests normality thought there is obvious skewing on the tails, particularly on the right.

A Shapiro Wilk’s Test statistic had a value of 0.9967 is close to 1 suggesting normality. However, the p-value (0.0010) is less than < 0.05 suggests that residuals may not follow a normal distribution.

Heteroscedasticity:
plot(back_select_model, which=3)

plot(back_select_model, which=4)

bptest(back_select_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  back_select_model
## BP = 210.37, df = 9, p-value < 2.2e-16
# Breusch-Pagan test; look for high p-value

Our Scale-Location plot shows that points appear somewhat evenly distributed above and below the trend line. While there is no obvious fan/wedge pattern, there is clustering in the center suggesting underfitting, high leverage outliers or that additional transformation may be needed. As our Cook’s Distance plot has no values with a greater than 1, we can rule our the effects of high leverage points.

The Breusch-Page test statistic BP (210.37) and the small p-value (2.2e-16) suggest evidence of heteroscedasticity. Though the values are different that we may need to transform the data to meet the Assumption of Homoscedasticity.

Independence:
acf(residuals(back_select_model))

durbinWatsonTest(back_select_model)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.01715261      2.028732    0.59
##  Alternative hypothesis: rho != 0
# Durbin Watson should be close to 2

Our Autocorrelation Function shows that there are lags above the blue dashed line, suggesting no autocorrelation. This is confirmed through a Durbin-Watson test statistic value of 2.03 and an autocorrelation value of -0.0017. Furthermore, as our p-value (0.558) is greater than 0.05, we do not have enough evidence to reject the null hypothesis that there is no autocorrelation. In other words, the test results suggest that our model’s residuals are independent and therefore do not violate the Independence Assumption.

Conclusion The backward selected model violates the Homoscedasticity Assumption and has mixed results for our Normality Assumption. We should therefore test applying transformations to our model.

4C: Model with Transformed Data

So far, our diagnostic plots show that our model is somewhat linear, independent, heteroschedastic, but our residuals vs leverage indicates the presence of outliers. We will attempt to use transformations to improve our model.

4Ci: Box-Cox Transformation on Dependent Variable

In this section, we apply a Box-Cox Transformation

stpbc_model <- boxcox(back_select_model, lambda = seq(-3,3))

plot(stpbc_model)

best_lambda <- stpbc_model$x[which(stpbc_model$y==max(stpbc_model$y))]
 
stp_model_inv <- lm((TARGET_WINS)^best_lambda ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stp75_train_df)

summary(stp_model_inv)
## 
## Call:
## lm(formula = (TARGET_WINS)^best_lambda ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_HR + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = stp75_train_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -181.370  -31.826   -1.085   29.386  220.329 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      24.328584  21.671396   1.123   0.2618    
## TEAM_BATTING_H    0.172877   0.015518  11.140  < 2e-16 ***
## TEAM_BATTING_2B  -0.077426   0.037841  -2.046   0.0409 *  
## TEAM_BATTING_3B   0.283539   0.069167   4.099 4.34e-05 ***
## TEAM_BATTING_SO   0.002826   0.012382   0.228   0.8195    
## TEAM_BASERUN_SB   0.106732   0.017584   6.070 1.58e-09 ***
## TEAM_PITCHING_HR  0.200389   0.036068   5.556 3.20e-08 ***
## TEAM_PITCHING_BB  0.005050   0.009782   0.516   0.6057    
## TEAM_PITCHING_SO -0.007150   0.007208  -0.992   0.3213    
## TEAM_FIELDING_E  -0.087885   0.007809 -11.254  < 2e-16 ***
## TEAM_FIELDING_DP -0.356033   0.052671  -6.760 1.90e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.64 on 1693 degrees of freedom
## Multiple R-squared:  0.2935, Adjusted R-squared:  0.2894 
## F-statistic: 70.34 on 10 and 1693 DF,  p-value: < 2.2e-16

Looking at the summary of our new Box-Cox transformed model shows three variables are not statistically significant. We should therefore remove them one at a time.

stp_model_inv <- update(stp_model_inv, . ~ . - TEAM_BATTING_SO)
summary(stp_model_inv)
## 
## Call:
## lm(formula = (TARGET_WINS)^best_lambda ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stp75_train_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -181.221  -31.839   -1.126   29.483  220.649 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      27.262063  17.442915   1.563   0.1183    
## TEAM_BATTING_H    0.171668   0.014583  11.772  < 2e-16 ***
## TEAM_BATTING_2B  -0.075855   0.037198  -2.039   0.0416 *  
## TEAM_BATTING_3B   0.280440   0.067802   4.136 3.70e-05 ***
## TEAM_BASERUN_SB   0.108141   0.016459   6.570 6.67e-11 ***
## TEAM_PITCHING_HR  0.204117   0.032149   6.349 2.78e-10 ***
## TEAM_PITCHING_BB  0.004005   0.008641   0.464   0.6431    
## TEAM_PITCHING_SO -0.006066   0.005419  -1.119   0.2632    
## TEAM_FIELDING_E  -0.088581   0.007188 -12.324  < 2e-16 ***
## TEAM_FIELDING_DP -0.355896   0.052652  -6.759 1.90e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.63 on 1694 degrees of freedom
## Multiple R-squared:  0.2935, Adjusted R-squared:  0.2898 
## F-statistic:  78.2 on 9 and 1694 DF,  p-value: < 2.2e-16
AIC(back_select_model)
## [1] 13588.9
stp_model_inv <- update(stp_model_inv, . ~ . - TEAM_PITCHING_BB)
summary(stp_model_inv)
## 
## Call:
## lm(formula = (TARGET_WINS)^best_lambda ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stp75_train_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -177.826  -31.742   -1.087   29.758  220.026 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      27.675276  17.416083   1.589   0.1122    
## TEAM_BATTING_H    0.171840   0.014574  11.791  < 2e-16 ***
## TEAM_BATTING_2B  -0.076072   0.037187  -2.046   0.0409 *  
## TEAM_BATTING_3B   0.284474   0.067225   4.232 2.44e-05 ***
## TEAM_BASERUN_SB   0.109793   0.016065   6.834 1.15e-11 ***
## TEAM_PITCHING_HR  0.207167   0.031461   6.585 6.05e-11 ***
## TEAM_PITCHING_SO -0.005744   0.005373  -1.069   0.2852    
## TEAM_FIELDING_E  -0.088788   0.007172 -12.380  < 2e-16 ***
## TEAM_FIELDING_DP -0.351605   0.051820  -6.785 1.60e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.62 on 1695 degrees of freedom
## Multiple R-squared:  0.2934, Adjusted R-squared:  0.2901 
## F-statistic: 87.98 on 8 and 1695 DF,  p-value: < 2.2e-16
AIC(back_select_model)
## [1] 13588.9
stp_model_inv <- update(stp_model_inv, . ~ . - TEAM_PITCHING_SO)
summary(stp_model_inv)
## 
## Call:
## lm(formula = (TARGET_WINS)^best_lambda ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP, data = stp75_train_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -178.015  -31.558   -1.138   29.634  218.413 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      16.581549  13.987093   1.185   0.2360    
## TEAM_BATTING_H    0.178723   0.013076  13.668  < 2e-16 ***
## TEAM_BATTING_2B  -0.084945   0.036250  -2.343   0.0192 *  
## TEAM_BATTING_3B   0.289749   0.067047   4.322 1.64e-05 ***
## TEAM_BASERUN_SB   0.107717   0.015948   6.754 1.97e-11 ***
## TEAM_PITCHING_HR  0.191652   0.027914   6.866 9.25e-12 ***
## TEAM_FIELDING_E  -0.092108   0.006465 -14.246  < 2e-16 ***
## TEAM_FIELDING_DP -0.345677   0.051525  -6.709 2.66e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.62 on 1696 degrees of freedom
## Multiple R-squared:  0.2929, Adjusted R-squared:   0.29 
## F-statistic: 100.4 on 7 and 1696 DF,  p-value: < 2.2e-16
AIC(back_select_model)
## [1] 13588.9
Testing Our Assumptions
Linearity
plot(stp_model_inv, which=1)

Our Residuals vs. Fitted plot suggest a fairly linear model.

Normality
plot(stp_model_inv, which = 2)

shapiro.test(residuals(stp_model_inv))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(stp_model_inv)
## W = 0.99561, p-value = 7.132e-05
# Shapiro-Wilk normality test: look for high p-value

Our QQ plot suggests normality thought there is obvious skewing on the tails, particularly on the right.

A Shapiro Wilk’s Test statistic had a value of 0.996, also suggesting normality. However, since the p-value (7.132e-05) is less than < 0.05 we may still be violating our normality assumption.

Heteroscedasticity
plot(stp_model_inv, which = 3)

plot(stp_model_inv, which = 4)

bptest(stp_model_inv)
## 
##  studentized Breusch-Pagan test
## 
## data:  stp_model_inv
## BP = 176.76, df = 7, p-value < 2.2e-16
# Breusch-Pagan test; look for high p-value

Our Scale-Location plot shows that points appear somewhat evenly distributed above and below the trend line. While there is no obvious fan/wedge pattern, there is clustering in the center suggesting underfitting, high leverage outliers or that additional transformation may be needed. As our Cook’s Distance plot has no values with a greater than 1, we can rule our the effects of high leverage points.

The Breusch-Page test statistic BP (176.76) and the small p-value (2.2e-16) suggest evidence of heteroscedasticity. Though the values are different that we may need to transform the data to meet the Assumption of Homoscedasticity.

Independence
acf(residuals(stp_model_inv))

durbinWatsonTest(stp_model_inv)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.02017478      2.035401    0.42
##  Alternative hypothesis: rho != 0
# Durbin Watson should be close to 2

Our Autocorrelation Function shows that there are lags above the blue dashed line, suggesting no autocorrelation. This is confirmed through a Durbin-Watson test statistic value of 2.03 and an autocorrelation value of -0.021. Furthermore, as our p-value (0.416) is greater than 0.05, we do not have enough evidence to reject the null hypothesis that there is no autocorrelation. In other words, the test results suggest that our model’s residuals are independent and therefore do not violate the Independence Assumption.

Conclusion Applying a Box-Cox transformation to our model appears does not seem to have improved the results of our backward selected model. This model continues to violate the Homoscedasticity Assumption and has mixed results for our Normality Assumption.

4Cii: Log Transformation on the Dependent Variable (Y)

stp_logy_model <- lm(log(TARGET_WINS + 1) ~., data = stp75_train_df)

# Backward step-wise regression
stpb_logy_model <- step(stp_logy_model, direction = "backward")
## Start:  AIC=-5941.6
## log(TARGET_WINS + 1) ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq    RSS     AIC
## - TEAM_PITCHING_SO  1    0.0024 51.230 -5943.5
## - TEAM_PITCHING_BB  1    0.0047 51.232 -5943.4
## - TEAM_BATTING_SO   1    0.0084 51.236 -5943.3
## - TEAM_BASERUN_CS   1    0.0139 51.241 -5943.1
## - TEAM_BATTING_BB   1    0.0471 51.275 -5942.0
## - TEAM_BATTING_HR   1    0.0490 51.276 -5942.0
## <none>                          51.227 -5941.6
## - TEAM_PITCHING_H   1    0.1907 51.418 -5937.3
## - TEAM_BATTING_2B   1    0.1920 51.419 -5937.2
## - TEAM_PITCHING_HR  1    0.1931 51.421 -5937.2
## - TEAM_BATTING_3B   1    0.3472 51.575 -5932.1
## - TEAM_BASERUN_SB   1    0.9718 52.199 -5911.6
## - TEAM_FIELDING_DP  1    1.4521 52.680 -5896.0
## - TEAM_FIELDING_E   1    3.1596 54.387 -5841.6
## - TEAM_BATTING_H    1    4.4396 55.667 -5802.0
## 
## Step:  AIC=-5943.51
## log(TARGET_WINS + 1) ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq    RSS     AIC
## - TEAM_BATTING_SO   1    0.0067 51.237 -5945.3
## - TEAM_PITCHING_BB  1    0.0100 51.240 -5945.2
## - TEAM_BASERUN_CS   1    0.0140 51.244 -5945.0
## - TEAM_BATTING_HR   1    0.0466 51.276 -5944.0
## <none>                          51.230 -5943.5
## - TEAM_BATTING_BB   1    0.0650 51.295 -5943.4
## - TEAM_PITCHING_H   1    0.1886 51.419 -5939.3
## - TEAM_PITCHING_HR  1    0.1929 51.423 -5939.1
## - TEAM_BATTING_2B   1    0.1947 51.425 -5939.0
## - TEAM_BATTING_3B   1    0.3456 51.576 -5934.1
## - TEAM_BASERUN_SB   1    0.9954 52.225 -5912.7
## - TEAM_FIELDING_DP  1    1.4512 52.681 -5897.9
## - TEAM_FIELDING_E   1    3.2759 54.506 -5839.9
## - TEAM_BATTING_H    1    4.4693 55.699 -5803.0
## 
## Step:  AIC=-5945.29
## log(TARGET_WINS + 1) ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
##     TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP
## 
##                    Df Sum of Sq    RSS     AIC
## - TEAM_PITCHING_BB  1    0.0109 51.247 -5946.9
## - TEAM_BASERUN_CS   1    0.0141 51.251 -5946.8
## - TEAM_BATTING_HR   1    0.0423 51.279 -5945.9
## <none>                          51.237 -5945.3
## - TEAM_BATTING_BB   1    0.0617 51.298 -5945.2
## - TEAM_BATTING_2B   1    0.1898 51.426 -5941.0
## - TEAM_PITCHING_H   1    0.1943 51.431 -5940.8
## - TEAM_PITCHING_HR  1    0.1959 51.432 -5940.8
## - TEAM_BATTING_3B   1    0.3397 51.576 -5936.0
## - TEAM_BASERUN_SB   1    1.1429 52.379 -5909.7
## - TEAM_FIELDING_DP  1    1.4766 52.713 -5898.9
## - TEAM_FIELDING_E   1    3.2753 54.512 -5841.7
## - TEAM_BATTING_H    1    5.7039 56.941 -5767.4
## 
## Step:  AIC=-5946.93
## log(TARGET_WINS + 1) ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
##     TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq    RSS     AIC
## - TEAM_BASERUN_CS   1    0.0129 51.260 -5948.5
## - TEAM_BATTING_HR   1    0.0317 51.279 -5947.9
## <none>                          51.247 -5946.9
## - TEAM_BATTING_BB   1    0.0824 51.330 -5946.2
## - TEAM_BATTING_2B   1    0.1959 51.443 -5942.4
## - TEAM_PITCHING_HR  1    0.2204 51.468 -5941.6
## - TEAM_BATTING_3B   1    0.3337 51.581 -5937.9
## - TEAM_PITCHING_H   1    0.3527 51.600 -5937.2
## - TEAM_BASERUN_SB   1    1.1358 52.383 -5911.6
## - TEAM_FIELDING_DP  1    1.4838 52.731 -5900.3
## - TEAM_FIELDING_E   1    3.3164 54.564 -5842.1
## - TEAM_BATTING_H    1    6.1379 57.385 -5756.2
## 
## Step:  AIC=-5948.5
## log(TARGET_WINS + 1) ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_H + 
##     TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq    RSS     AIC
## - TEAM_BATTING_HR   1    0.0288 51.289 -5949.5
## <none>                          51.260 -5948.5
## - TEAM_BATTING_BB   1    0.0907 51.351 -5947.5
## - TEAM_BATTING_2B   1    0.2005 51.461 -5943.9
## - TEAM_PITCHING_HR  1    0.2205 51.481 -5943.2
## - TEAM_BATTING_3B   1    0.3362 51.597 -5939.4
## - TEAM_PITCHING_H   1    0.3639 51.624 -5938.4
## - TEAM_BASERUN_SB   1    1.1452 52.406 -5912.9
## - TEAM_FIELDING_DP  1    1.4971 52.757 -5901.4
## - TEAM_FIELDING_E   1    3.4025 54.663 -5841.0
## - TEAM_BATTING_H    1    6.1321 57.392 -5758.0
## 
## Step:  AIC=-5949.55
## log(TARGET_WINS + 1) ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq    RSS     AIC
## <none>                          51.289 -5949.5
## - TEAM_BATTING_BB   1    0.0828 51.372 -5948.8
## - TEAM_BATTING_2B   1    0.2022 51.491 -5944.8
## - TEAM_PITCHING_H   1    0.3366 51.626 -5940.4
## - TEAM_BATTING_3B   1    0.4030 51.692 -5938.2
## - TEAM_PITCHING_HR  1    0.9628 52.252 -5919.9
## - TEAM_BASERUN_SB   1    1.1686 52.458 -5913.2
## - TEAM_FIELDING_DP  1    1.5021 52.791 -5902.4
## - TEAM_FIELDING_E   1    3.3737 54.663 -5843.0
## - TEAM_BATTING_H    1    6.1100 57.399 -5759.8
Testing Our Assumptions
Linearity
plot(stp_logy_model, which=1)

Our Residuals vs. Fitted plot suggests a somewhat linear model but there is visible bowing in the trend line.

Normality
plot(stp_logy_model, which = 2)

shapiro.test(residuals(stp_logy_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(stp_logy_model)
## W = 0.98025, p-value = 1.416e-14
# Shapiro-Wilk normality test: look for high p-value

Our QQ plot suggests normality thought there is obvious skewing on the tails, particularly on the left.

A Shapiro Wilk’s Test statistic had a value of 0.980, also suggesting normality. However, since the p-value (1.416e-14) is less than < 0.05 we may still be violating our normality assumption.

Heteroscedasticity
plot(stp_logy_model, which = 3)

plot(stp_logy_model, which = 4)

bptest(stp_logy_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  stp_logy_model
## BP = 383.07, df = 14, p-value < 2.2e-16
# Breusch-Pagan test; look for high p-value

Our Scale-Location plot shows clear bowing in the trend line, suggesting that the variance of the residuals is not constant and therefore heteroschedastic. While there is no obvious fan/wedge pattern, there is clustering in the center suggesting underfitting, high leverage outliers or that additional transformation may be needed. As our Cook’s Distance plot has no values with a greater than 1, we can rule our the effects of high leverage points. However, it should be pointed the leverage points appear to be more influential than in our backwards selected model or our Box-Cox transformed model.

The Breusch-Page test statistic BP (383.07) indicates a substantial relationship between the residual variance and the predictors and the small p-value (2.2e-16) suggest evidence of heteroscedasticity. Though the values are different that we may need to transform the data to meet the Assumption of Homoscedasticity.

Independence
acf(residuals(stp_logy_model))

durbinWatsonTest(stp_logy_model)
##  lag Autocorrelation D-W Statistic p-value
##    1    -0.004100597      2.002974   0.946
##  Alternative hypothesis: rho != 0
# Durbin Watson should be close to 2

Our Autocorrelation Function shows that there are lags above the blue dashed line, suggesting no autocorrelation. This is confirmed through a Durbin-Watson test statistic value of 2.00 and an autocorrelation value of -0.0041. Furthermore, as our p-value (0.96) is greater than 0.05, we do not have enough evidence to reject the null hypothesis that there is no autocorrelation. In other words, the test results suggest that our model’s residuals are independent and therefore do not violate the Independence Assumption.

Conclusion Applying a log transformation the dependent variable in our model appears to inferior to the backward selected model and the Box-Cox transformed model. We should therefore avoid using this model.

4Ciii: Log Transformation on Independent Variables (X)

Some of our predictors (TEAM_PITCHING_SO, TEAM_PITCHING_BB, & TEAM_PITCHING_H) showed the presence of outliers. We will transform these variables with a log transformation.

stplg75_train_df <- stp75_train_df %>%
  mutate(
    TEAM_PITCHING_SO = log(TEAM_PITCHING_SO),
    TEAM_PITCHING_BB = log(TEAM_PITCHING_BB),
    TEAM_PITCHING_H = log(TEAM_PITCHING_H),
  )

stplg25_test_df <- stp25_test_df %>%
  mutate(
    TEAM_PITCHING_SO = log1p(TEAM_PITCHING_SO),
    TEAM_PITCHING_BB = log1p(TEAM_PITCHING_BB),
    TEAM_PITCHING_H = log1p(TEAM_PITCHING_H),
  )

stplg75_train_df %>%
  gather(variable, value, -TARGET_WINS) %>%
  ggplot(., aes(value, TARGET_WINS)) + 
  geom_point(fill = "#628B3A", color="#628B3A")  + 
  geom_smooth(method = "lm", se = FALSE, color = "black") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = "Wins")
## `geom_smooth()` using formula = 'y ~ x'

Our scatter plots show some improvement in our outliers.

Backwards Selection

We will use backward step-wise selection to build our model.

stp_logx_model_full <- lm(TARGET_WINS ~ ., data = stplg75_train_df)

# Backward step-wise regression
stp_logx_model <- step(stp_logx_model_full, direction = "backward")
## Start:  AIC=8695.75
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq    RSS    AIC
## - TEAM_PITCHING_HR  1       7.2 275487 8693.8
## - TEAM_BASERUN_CS   1     148.0 275628 8694.7
## - TEAM_BATTING_HR   1     238.1 275718 8695.2
## <none>                          275480 8695.8
## - TEAM_BATTING_2B   1     675.3 276155 8697.9
## - TEAM_PITCHING_H   1    2525.5 278005 8709.3
## - TEAM_BATTING_SO   1    3066.1 278546 8712.6
## - TEAM_PITCHING_SO  1    3646.6 279127 8716.2
## - TEAM_BATTING_3B   1    4252.8 279733 8719.9
## - TEAM_FIELDING_DP  1    6895.7 282376 8735.9
## - TEAM_BATTING_H    1    6934.6 282415 8736.1
## - TEAM_BASERUN_SB   1    9010.2 284490 8748.6
## - TEAM_PITCHING_BB  1    9160.2 284640 8749.5
## - TEAM_BATTING_BB   1   10137.4 285617 8755.3
## - TEAM_FIELDING_E   1   15045.3 290525 8784.4
## 
## Step:  AIC=8693.8
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq    RSS    AIC
## - TEAM_BASERUN_CS   1     147.3 275634 8692.7
## <none>                          275487 8693.8
## - TEAM_BATTING_2B   1     678.9 276166 8696.0
## - TEAM_PITCHING_H   1    2798.8 278286 8709.0
## - TEAM_BATTING_SO   1    3154.4 278642 8711.2
## - TEAM_BATTING_HR   1    3200.3 278687 8711.5
## - TEAM_PITCHING_SO  1    3847.4 279335 8715.4
## - TEAM_BATTING_3B   1    4347.2 279834 8718.5
## - TEAM_FIELDING_DP  1    6894.3 282381 8733.9
## - TEAM_BATTING_H    1    6928.5 282416 8734.1
## - TEAM_BASERUN_SB   1    9068.5 284556 8747.0
## - TEAM_PITCHING_BB  1    9277.1 284764 8748.2
## - TEAM_BATTING_BB   1   10187.3 285674 8753.7
## - TEAM_FIELDING_E   1   16384.1 291871 8790.2
## 
## Step:  AIC=8692.71
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                          275634 8692.7
## - TEAM_BATTING_2B   1     713.4 276348 8695.1
## - TEAM_PITCHING_H   1    2705.7 278340 8707.4
## - TEAM_BATTING_SO   1    3248.7 278883 8710.7
## - TEAM_BATTING_HR   1    3608.4 279243 8712.9
## - TEAM_PITCHING_SO  1    3986.6 279621 8715.2
## - TEAM_BATTING_3B   1    4358.6 279993 8717.4
## - TEAM_FIELDING_DP  1    6979.1 282614 8733.3
## - TEAM_BATTING_H    1    7058.8 282693 8733.8
## - TEAM_BASERUN_SB   1    9037.2 284672 8745.7
## - TEAM_PITCHING_BB  1    9222.8 284857 8746.8
## - TEAM_BATTING_BB   1   10245.1 285880 8752.9
## - TEAM_FIELDING_E   1   16416.1 292051 8789.3

Looking at the summary for this model shows that all predictors have high statistic significance.

summary(stp_logx_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stplg75_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.956  -8.307  -0.153   7.949  61.107 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -41.094143  18.185201  -2.260   0.0240 *  
## TEAM_BATTING_H     0.038425   0.005839   6.581 6.23e-11 ***
## TEAM_BATTING_2B   -0.021815   0.010427  -2.092   0.0366 *  
## TEAM_BATTING_3B    0.103040   0.019926   5.171 2.60e-07 ***
## TEAM_BATTING_HR    0.050302   0.010691   4.705 2.74e-06 ***
## TEAM_BATTING_BB    0.071239   0.008986   7.928 4.01e-15 ***
## TEAM_BATTING_SO   -0.023366   0.005234  -4.464 8.56e-06 ***
## TEAM_BASERUN_SB    0.037557   0.005044   7.446 1.53e-13 ***
## TEAM_PITCHING_H   17.554470   4.308701   4.074 4.83e-05 ***
## TEAM_PITCHING_BB -28.690911   3.814242  -7.522 8.71e-14 ***
## TEAM_PITCHING_SO  16.554274   3.347389   4.945 8.35e-07 ***
## TEAM_FIELDING_E   -0.037062   0.003693 -10.036  < 2e-16 ***
## TEAM_FIELDING_DP  -0.095601   0.014610  -6.543 7.95e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.77 on 1691 degrees of freedom
## Multiple R-squared:  0.331,  Adjusted R-squared:  0.3263 
## F-statistic: 69.72 on 12 and 1691 DF,  p-value: < 2.2e-16

The Residuals vs. Fitted and QQ Plots show a fairly linear pattern, while Scale-Location plot suggest Homoscedasticity. The Residuals vs Leverage plot reveals that our outliers have less leverage than in the pre- log transformed model.

plot(stp_logx_model)

stplg75_train_df <- stplg75_train_df |>
  mutate(
    n = row_number()
  )
Multicolinearity

A VIF Test shows several variables that are highly correlated, including TEAM_PITCHING_H, TEAM_BATTING_BB, TEAM_PITCHING_SO, TEAM_PITCHING_BB, TEAM_FIELDING_E & TEAM_BATTING_H. We may choose to leave these out.

vif(stp_logx_model)
##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         6.803881         2.484158         3.052159         4.407697 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_H 
##        12.534258        15.179164         1.884433        15.254714 
## TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         8.746531        11.608842         6.971147         1.355683
Testing Our Assumptions
Linearity
plot(stp_logx_model, which=1)

Our Residuals vs. Fitted plot suggest a fairly linear model.

Normality
plot(stp_logx_model, which = 2)

# Shapiro-Wilk normality test: look for high p-value
shapiro.test(residuals(stp_logx_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(stp_logx_model)
## W = 0.99433, p-value = 4.301e-06

Our QQ plot suggests normality thought there is some skewing on the tails, particularly on the right. The skewing appears to be less pitched than in our backward-selected model or Box-Cox Transformed model.

A Shapiro Wilk’s Test statistic had a value of 0.9943, also suggesting normality. However, since the p-value (4.273e-06) is less than < 0.05 we may still be violating our normality assumption.

Heteroscedasticity
plot(stp_logx_model, which = 3)

plot(stp_logx_model, which = 4)

bptest(stp_logx_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  stp_logx_model
## BP = 302.08, df = 12, p-value < 2.2e-16
# Breusch-Pagan test; look for high p-value

Our Scale-Location plot shows that points appear somewhat evenly distributed above and below the trend line but there is clear bowing in the trend line, suggesting that the variance of the residuals is not constant and therefore heteroschedastic. While there is no obvious fan/wedge pattern, there is clustering in the center suggesting underfitting, high leverage outliers or that additional transformation may be needed. As our Cook’s Distance plot has no values with a greater than 1, we can rule our the effects of high leverage points.

The Breusch-Page test statistic BP (302.21) and the small p-value (2.2e-16) suggest evidence of heteroscedasticity. Though the values are different that we may need to transform the data to meet the Assumption of Homoscedasticity.

Independence
acf(residuals(stp_logx_model))

durbinWatsonTest(stp_logx_model)
##  lag Autocorrelation D-W Statistic p-value
##    1      -0.0084413      2.011121   0.822
##  Alternative hypothesis: rho != 0
# Durbin Watson should be close to 2

Our Autocorrelation Function shows that there are lags above the blue dashed line, suggesting no autocorrelation. This is confirmed through a Durbin-Watson test statistic value of 2.011 and an autocorrelation value of -0.008. Furthermore, as our p-value (0.822) is greater than 0.05, we do not have enough evidence to reject the null hypothesis that there is no autocorrelation. In other words, the test results suggest that our model’s residuals are independent and therefore do not violate the Independence Assumption.

Model 5: Lasso Regression

The Least Absolute Shrinkage and Selection Operator (LASSO) selection method uses a shrinkage approach to determining the optimal predictors by attempting to find a balance between simplicity and accuracy.It applies a penalty to the standard linear regression model to encourage the coefficients of features with weak influence to equal zero to prevent overfitting.

x = model.matrix(TARGET_WINS ~ ., data = stplg75_train_df)
y = stplg75_train_df$TARGET_WINS

fit_lasso = glmnet(x, y, alpha=1)
plot(fit_lasso, xvar="lambda", label=TRUE)

cv_lasso = cv.glmnet(x,y,alpha=1)
#plot(cv.lasso)
coef(cv_lasso)
## 17 x 1 sparse Matrix of class "dgCMatrix"
##                            s1
## (Intercept)      20.682694712
## (Intercept)       .          
## TEAM_BATTING_H    0.042771197
## TEAM_BATTING_2B   .          
## TEAM_BATTING_3B   0.025371136
## TEAM_BATTING_HR   .          
## TEAM_BATTING_BB   0.009857031
## TEAM_BATTING_SO   .          
## TEAM_BASERUN_SB   0.020057853
## TEAM_BASERUN_CS   .          
## TEAM_PITCHING_H   .          
## TEAM_PITCHING_HR  0.016665501
## TEAM_PITCHING_BB  .          
## TEAM_PITCHING_SO  .          
## TEAM_FIELDING_E  -0.017855099
## TEAM_FIELDING_DP -0.062078149
## n                 .

Performing LASSO on our training subset – 75% of full training set with log transformation applied – gives us 7 predictors: - TEAM_BATTING_H - TEAM_BATTING_3B - TEAM_BATTING_BB - TEAM_BASERUN_SB - TEAM_PITCHING_HR - TEAM_FIELDING_E - TEAM_FIELDING_DP

lasso_model <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stplg75_train_df)

summary(lasso_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP, data = stplg75_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.218  -8.637   0.087   8.262  56.386 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      18.793612   4.011279   4.685 3.02e-06 ***
## TEAM_BATTING_H    0.044628   0.002843  15.700  < 2e-16 ***
## TEAM_BATTING_3B   0.082614   0.018680   4.423 1.04e-05 ***
## TEAM_BATTING_BB   0.009393   0.003782   2.484   0.0131 *  
## TEAM_BASERUN_SB   0.027885   0.004659   5.985 2.63e-09 ***
## TEAM_PITCHING_HR  0.044835   0.007878   5.691 1.48e-08 ***
## TEAM_FIELDING_E  -0.023203   0.002170 -10.691  < 2e-16 ***
## TEAM_FIELDING_DP -0.106021   0.014819  -7.155 1.24e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.03 on 1696 degrees of freedom
## Multiple R-squared:  0.301,  Adjusted R-squared:  0.2981 
## F-statistic: 104.3 on 7 and 1696 DF,  p-value: < 2.2e-16
vif(lasso_model)
##   TEAM_BATTING_H  TEAM_BATTING_3B  TEAM_BATTING_BB  TEAM_BASERUN_SB 
##         1.547697         2.574600         2.130845         1.543286 
## TEAM_PITCHING_HR  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         2.354824         2.310808         1.338637

Our VIF test shows no strong correlation between predictors.

Testing Our Assumptions
Linearity
plot(lasso_model, which=1)

Our Residuals vs. Fitted plot suggest a fairly linear model.

Normality
plot(lasso_model, which = 2)

# Shapiro-Wilk normality test: look for high p-value
shapiro.test(residuals(lasso_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lasso_model)
## W = 0.99644, p-value = 0.0005262

Our QQ plot suggests normality thought there is some skewing on the tails, particularly on the right. The skewing appears to be less pitched than in our backward-selected model or Box-Cox Transformed model.

A Shapiro Wilk’s Test statistic had a value of 0.9964, also suggesting normality. However, since the p-value (0.0005) is less than < 0.05 we may still be violating our normality assumption.

Heteroscedasticity
plot(lasso_model, which = 3)

plot(lasso_model, which = 4)

bptest(lasso_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  lasso_model
## BP = 209.89, df = 7, p-value < 2.2e-16
# Breusch-Pagan test; look for high p-value

Our Scale-Location plot shows that points appear somewhat evenly distributed above and below the trend line but there is slight bowing in the trend line, suggesting that the variance of the residuals is not constant and therefore heteroschedastic. While there is no obvious fan/wedge pattern, there is clustering in the center suggesting underfitting, high leverage outliers or that additional transformation may be needed. As our Cook’s Distance plot has no values with a greater than 1, we can rule our the effects of high leverage points.

The Breusch-Page test statistic BP (209.89) and the small p-value (2.2e-16) suggest evidence of heteroscedasticity. Though the values are different that we may need to transform the data to meet the Assumption of Homoscedasticity.

Independence
acf(residuals(lasso_model))

durbinWatsonTest(lasso_model)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.01774321      2.029976   0.508
##  Alternative hypothesis: rho != 0
# Durbin Watson should be close to 2

Our Autocorrelation Function shows that there are lags above the blue dashed line, suggesting no autocorrelation. This is confirmed through a Durbin-Watson test statistic value of 2.03 and an autocorrelation value of -0.0177. Furthermore, as our p-value (0.504) is greater than 0.05, we do not have enough evidence to reject the null hypothesis that there is no autocorrelation. In other words, the test results suggest that our model’s residuals are independent and therefore do not violate the Independence Assumption.

Comparing All Models

Here we are comparing the all models using our training subset (75% of observations).

#stp_model_inv - is throwing errors due to difference in scales
compare_performance(base_model, high_impact_model, log_model, back_select_model, stp_model_sm, stp_logy_model, stp_logx_model, lasso_model, rank = TRUE)

The “compare_performance” function compares various test results including Adjusted R-Squared, AIC, and BIC. Based on these results, “compare_performance” suggests that our Log Transformed Dependent variable (stp_logy_model_) performed the best. However, this model had the most obvious violations of regression assumptions. Thus we will discard it moving forward. Re-running “compare_performance” without this model shows that

  • Note that we were unable to compare the performance as the scales differ our Step-wise model with transformed independent variables (stp_logx_model) performs best on the tests above.
#stp_model_inv - is throwing errors due to difference in scales
compare_performance(base_model, high_impact_model, log_model, back_select_model, stp_model_sm, stp_logx_model, lasso_model)

Conclusion:

If interpretability is most important → Use base Stats Model.

If statistical optimization is preferred → Use High-Impact Model.

If non-linearity is a concern → Use Log-Transformed Model.

If statistical significance is most important → Log-Transformed Step-wise Model

Model Selection

Testing the model

In this section, we will use the test dataframe to test the accuracy of our model’s predictions

# Do we need to clean eval dataset first?

# Stats Model predictions
predictedWins = predict(base_model, stp25_test_df)
stp25_test_df["PREDICTED_WINS"] = predictedWins

# Improved Model predictions
predictedWins = predict(improved_model, stp25_test_df)
stp25_test_df["PREDICTED_WINS_IMP"] = predictedWins

# High-Impact Model
predictedWins = predict(high_impact_model, stp25_test_df)
stp25_test_df["PREDICTED_WINS_HIM"] = predictedWins

# Log-Transformed Model
predictedWins = predict(log_model, train_df_log)
train_df_log["PREDICTED_WINS_LOG"] = predictedWins

# Log-Transformed Step-wise Model
predictedWins = predict(stp_logx_model, stplg25_test_df)
stplg25_test_df["PREDICTED_WINS_STP"] = predictedWins

# BoxCox-Transformed Step-wise Model
### TODO: Backward Tranform
#predictedWins = predict(stp_mode_inv, train_df_log)
#train_df_log["PREDICTED_WINS_INV"] = predictedWins
# Create a df to store our results
evaluation_metrics <- data.frame(
  model_name = character(0), 
  MAE = numeric(0),     
  RMSE = numeric(0),    
  RSquared = numeric(0)      
)
  
eval_predictions <- function(model, predict_col, target_col, model_name) {
  
  pred = lm(predict_col ~ target_col)
  plot(predict_col, target_col, xlab="Actual Wins", ylab="Predicted Wins")
  abline(pred)
  
  plot(model$residuals)
  
  model_metrics = data.frame(
    model_name = model_name,
    MAE = mae(target_col, predict_col),
    RMSE = rmse(target_col, predict_col),
    RSquared = (cor(target_col, predict_col)^2 )
  )
  
  evaluation_metrics <<- rbind(evaluation_metrics, model_metrics)
}

# Evaluate Stats Model predictions
eval_predictions(base_model, stp25_test_df$PREDICTED_WINS, stp25_test_df$TARGET_WINS, "Stats")

# Improved Model
eval_predictions(improved_model, stp25_test_df$PREDICTED_WINS_IMP, stp25_test_df$TARGET_WINS, "Improved")

# Evaluate High-Impact Model
eval_predictions(high_impact_model, stp25_test_df$PREDICTED_WINS_HIM, stp25_test_df$TARGET_WINS, "High-Impact")

# Evaluate Log-Transformed Model
eval_predictions(log_model, train_df_log$PREDICTED_WINS_LOG, train_df_log$TARGET_WINS, "Log-Tranformed")

# Evaluate Log-Transformed Step-wise Model
eval_predictions(stp_logx_model, stplg25_test_df$PREDICTED_WINS_STP, stplg25_test_df$TARGET_WINS, "LT Step-wise")

# TODO: backward transform
# Evaluate Box-Cox Transformed Step-wise Model
#eval_predictions(stp_model_inv, train_df_log$PREDICTED_WINS_INV, train_df_log$TARGET_WINS, "Box-Cox")

To evaluate the accuracy of each our model’s predictions, we will compare: - the Mean Absolute Error (MAE) - the Root Mean Squared Error (RMSE) - the R-squared (R²)

print(evaluation_metrics)
##       model_name      MAE     RMSE  RSquared
## 1          Stats 10.96006 14.00786 0.2234439
## 2       Improved 10.69973 13.76642 0.2507960
## 3    High-Impact 10.89616 13.95167 0.2292489
## 4 Log-Tranformed 10.81108 13.57413 0.2379521
## 5   LT Step-wise 10.12808 13.15547 0.3175279

A quick examination of these shows the Log-Transformed Step-wise Model has the lowest MAE and RMSE as well as the highest R-squared, suggesting that this model is somewhat more accurate than the other four evaluated. For confirmation we will perform a 5-fold cross-validation below:

library(caret)

train_control <- trainControl(method = "cv", number = 5)  # 10-fold cross-validation

model_cv <- train(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
    TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
    TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
    TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stplg75_train_df, method = "lm", trControl = train_control)

print(model_cv)
## Linear Regression 
## 
## 1704 samples
##   12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1363, 1364, 1363, 1363, 1363 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   12.90372  0.3150703  10.02132
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(stp_logx_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stplg75_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.956  -8.307  -0.153   7.949  61.107 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -41.094143  18.185201  -2.260   0.0240 *  
## TEAM_BATTING_H     0.038425   0.005839   6.581 6.23e-11 ***
## TEAM_BATTING_2B   -0.021815   0.010427  -2.092   0.0366 *  
## TEAM_BATTING_3B    0.103040   0.019926   5.171 2.60e-07 ***
## TEAM_BATTING_HR    0.050302   0.010691   4.705 2.74e-06 ***
## TEAM_BATTING_BB    0.071239   0.008986   7.928 4.01e-15 ***
## TEAM_BATTING_SO   -0.023366   0.005234  -4.464 8.56e-06 ***
## TEAM_BASERUN_SB    0.037557   0.005044   7.446 1.53e-13 ***
## TEAM_PITCHING_H   17.554470   4.308701   4.074 4.83e-05 ***
## TEAM_PITCHING_BB -28.690911   3.814242  -7.522 8.71e-14 ***
## TEAM_PITCHING_SO  16.554274   3.347389   4.945 8.35e-07 ***
## TEAM_FIELDING_E   -0.037062   0.003693 -10.036  < 2e-16 ***
## TEAM_FIELDING_DP  -0.095601   0.014610  -6.543 7.95e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.77 on 1691 degrees of freedom
## Multiple R-squared:  0.331,  Adjusted R-squared:  0.3263 
## F-statistic: 69.72 on 12 and 1691 DF,  p-value: < 2.2e-16

MAE and RMSE is even lower when using cross-validation. Although our R-squared value decreased from 0.3183 to 0.3034024, this value is still higher than the model with second highest R-squared value of 0.2368 for the Log-Tranformed model.

Conclusion

Our previous test suggest that the Log-Transformed Step-wise Model appears to be the most accurate at predicting wins over a team’s season. However, with 12 predictors, this model is fairly complex. Given the audience for our report and the need to clearly describe our model to the coach, we should select a simpler model that is still statistically significant and fairly accurate. Therefore, we will select our Improved model.

Final Analysis

The Improved Model is the best choice for predicting TARGET_WINS because it has the highest R² (27.89%), the lowest residual standard error (13.25), and avoids severe multicollinearity issues.

Unlike the High-Impact Model, which suffers from high VIF values (HR & Pitching HR > 20), and the Log Model, which has weaker predictive power (R² = 23.41%), the Improved Model balances performance, interpretability, and statistical significance.

Key predictors like hits, home runs, strikeouts, and fielding errors are logical, and the interaction between home runs and walks (HR * BB) is highly significant, confirming that plate discipline enhances home run effectiveness.

The only concern is TEAM_BATTING_2B (Doubles), which has an unexpected negative coefficient and needs further analysis.

Assumptions of Multiple Linear Regression:

Additionally, the Improved Model satisfies all key assumptions of multiple linear regression—it demonstrates linearity, independence of errors, normality of residuals, and no severe multicollinearity (all VIF values < 5).

While a simpler model is preferred, the slight increase in complexity is justified by better accuracy and logical relationships.

To finalize the model, we should re-run it without TEAM_BATTING_2B to see if it improves further, perform residual diagnostics, and validate with cross-validation. Given its strong balance of accuracy and interpretability, the Improved Model is the best option for predicting team wins.

Making Predictions Using the Evaluation Dataset:

Data Preparation:

  • Analyze Missing Values from Train Datasets.
missing_values <- eval_df %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count")

print(missing_values)
## # A tibble: 15 × 2
##    Variable         Missing_Count
##    <chr>                    <int>
##  1 TEAM_BATTING_H               0
##  2 TEAM_BATTING_2B              0
##  3 TEAM_BATTING_3B              0
##  4 TEAM_BATTING_HR              0
##  5 TEAM_BATTING_BB              0
##  6 TEAM_BATTING_SO             18
##  7 TEAM_BASERUN_SB             13
##  8 TEAM_BASERUN_CS             87
##  9 TEAM_BATTING_HBP           240
## 10 TEAM_PITCHING_H              0
## 11 TEAM_PITCHING_HR             0
## 12 TEAM_PITCHING_BB             0
## 13 TEAM_PITCHING_SO            18
## 14 TEAM_FIELDING_E              0
## 15 TEAM_FIELDING_DP            31

Removing Mostly NA and INDEX Column:

eval_df <- eval_df[, !names(eval_df) %in% "INDEX"]
eval_df <- eval_df[, !names(eval_df) %in% "TEAM_BATTING_HBP"]
eval_df

Mean/Median Imputation (For Numeric Data)

Instead of removing missing values, fill them with the mean or median:

eval_df$TEAM_BATTING_SO[is.na(eval_df$TEAM_BATTING_SO)] <- mean(eval_df$TEAM_BATTING_SO, na.rm = TRUE)
eval_df$TEAM_BASERUN_SB[is.na(eval_df$TEAM_BASERUN_SB)] <- mean(eval_df$TEAM_BASERUN_SB, na.rm = TRUE)
eval_df$TEAM_BASERUN_CS[is.na(eval_df$TEAM_BASERUN_CS)] <- mean(eval_df$TEAM_BASERUN_CS, na.rm = TRUE)
eval_df$TEAM_FIELDING_DP[is.na(eval_df$TEAM_FIELDING_DP)] <- mean(eval_df$TEAM_FIELDING_DP, na.rm = TRUE)
eval_df$TEAM_PITCHING_SO[is.na(eval_df$TEAM_PITCHING_SO)] <- mean(eval_df$TEAM_PITCHING_SO, na.rm = TRUE)

4. Verifying That Missing Values Are Fixed

sum(is.na(eval_df))  # Total missing values
## [1] 0
colSums(is.na(eval_df))  # Missing values per column
##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##                0                0                0                0 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##                0                0                0                0 
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##                0                0                0                0 
##  TEAM_FIELDING_E TEAM_FIELDING_DP 
##                0                0
skim(eval_df)
Data summary
Name eval_df
Number of rows 259
Number of columns 14
_______________________
Column type frequency:
numeric 14
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
TEAM_BATTING_H 0 1 1469.39 150.66 819 1387.0 1455.00 1548.0 2170 ▁▂▇▁▁
TEAM_BATTING_2B 0 1 241.32 49.52 44 210.0 239.00 278.5 376 ▁▂▇▇▂
TEAM_BATTING_3B 0 1 55.91 27.14 14 35.0 52.00 72.0 155 ▇▇▃▁▁
TEAM_BATTING_HR 0 1 95.63 56.33 0 44.5 101.00 135.5 242 ▆▅▇▃▁
TEAM_BATTING_BB 0 1 498.96 120.59 15 436.5 509.00 565.5 792 ▁▁▅▇▁
TEAM_BATTING_SO 0 1 709.34 234.48 0 565.0 709.34 904.5 1268 ▁▃▇▆▂
TEAM_BASERUN_SB 0 1 123.70 91.00 0 60.5 96.00 149.0 580 ▇▅▁▁▁
TEAM_BASERUN_CS 0 1 52.32 18.81 0 44.0 52.32 56.0 154 ▁▇▂▁▁
TEAM_PITCHING_H 0 1 1813.46 1662.91 1155 1426.5 1515.00 1681.0 22768 ▇▁▁▁▁
TEAM_PITCHING_HR 0 1 102.15 57.65 0 52.0 104.00 142.5 336 ▇▇▆▁▁
TEAM_PITCHING_BB 0 1 552.42 172.95 136 471.0 526.00 606.5 2008 ▆▇▁▁▁
TEAM_PITCHING_SO 0 1 799.67 611.78 0 622.5 782.00 927.5 9963 ▇▁▁▁▁
TEAM_FIELDING_E 0 1 249.75 230.90 73 131.0 163.00 252.0 1568 ▇▁▁▁▁
TEAM_FIELDING_DP 0 1 146.06 24.28 69 134.5 146.06 160.5 204 ▁▂▇▆▁

Make Predictions Using the Improved_Model:

# Generate predictions using the trained Improved Model
eval_df$PREDICTED_WINS <- predict(improved_model, newdata = eval_df)

# View a sample of predictions
head(eval_df$PREDICTED_WINS)
## [1] 69.20074 71.69295 79.61681 81.88350 78.60632 74.74956

Interpretation of Predicted TARGET_WINS Values

The predicted values represent the expected number of wins for teams based on their batting, pitching, and fielding statistics using the Improved Model.

For example:

A team with a predicted 69.15 wins is expected to win around 69 games in a full season. A team with a predicted 82.52 wins is expected to perform better, likely finishing with around 82-83 wins. The variation in predictions suggests that some teams are stronger than others based on key performance metrics (hits, home runs, walks, strikeouts, fielding errors, etc.).

Evaluating the Accuracy of Predictions Against Actual TARGET_WINS

Now that we have predicted TARGET_WINS, we need to compare these predictions with the actual values in the TARGET_WINS column to assess model accuracy.

Step 1: Compute Error Metrics

# Check structure of the dataset
str(eval_df)
## 'data.frame':    259 obs. of  15 variables:
##  $ TEAM_BATTING_H  : int  1209 1221 1395 1539 1445 1431 1430 1385 1259 1397 ...
##  $ TEAM_BATTING_2B : int  170 151 183 309 203 236 219 158 177 212 ...
##  $ TEAM_BATTING_3B : int  33 29 29 29 68 53 55 42 78 42 ...
##  $ TEAM_BATTING_HR : int  83 88 93 159 5 10 37 33 23 58 ...
##  $ TEAM_BATTING_BB : int  447 516 509 486 95 215 568 356 466 452 ...
##  $ TEAM_BATTING_SO : num  1080 929 816 914 416 377 527 609 689 584 ...
##  $ TEAM_BASERUN_SB : num  62 54 59 148 124 ...
##  $ TEAM_BASERUN_CS : num  50 39 47 57 52.3 ...
##  $ TEAM_PITCHING_H : int  1209 1221 1395 1539 3902 2793 1544 1626 1342 1489 ...
##  $ TEAM_PITCHING_HR: int  83 88 93 159 14 20 40 39 25 62 ...
##  $ TEAM_PITCHING_BB: int  447 516 509 486 257 420 613 418 497 482 ...
##  $ TEAM_PITCHING_SO: num  1080 929 816 914 1123 ...
##  $ TEAM_FIELDING_E : int  140 135 156 124 616 572 490 328 226 184 ...
##  $ TEAM_FIELDING_DP: num  156 164 153 154 130 ...
##  $ PREDICTED_WINS  : num  69.2 71.7 79.6 81.9 78.6 ...
# Check if ACTUAL_WINS and PREDICTED_WINS are numeric
is.numeric(eval_df$ACTUAL_WINS)
## [1] FALSE
is.numeric(eval_df$PREDICTED_WINS)
## [1] TRUE

Step 2: Add an INDEX Column to Each Dataset

# Add a row index to both datasets
train_df$INDEX <- seq_len(nrow(train_df))  
eval_df$INDEX <- seq_len(nrow(eval_df))

Step 3: Align and Merge the Datasets

# First, check if both datasets have a common identifier (like INDEX)
head(train_df$INDEX)
## [1] 1 2 3 4 5 6
head(eval_df$INDEX)
## [1] 1 2 3 4 5 6
# Merge train_df and eval_df using INDEX (if applicable)
merged_df <- merge(train_df[, c("INDEX", "TARGET_WINS")], 
                   eval_df[, c("INDEX", "PREDICTED_WINS")], 
                   by = "INDEX", all = FALSE)

# Now check if both columns have equal length
nrow(merged_df)  
## [1] 259

Ensure data consistency before calculating accuracy metrics. Convert to numeric to avoid errors in correlation or calculations.

Step 4: Compute Accuracy Metrics

# Compute MAE, RMSE, and R²
mae <- mae(merged_df$TARGET_WINS, merged_df$PREDICTED_WINS)  
rmse <- rmse(merged_df$TARGET_WINS, merged_df$PREDICTED_WINS)  
r_squared <- cor(merged_df$TARGET_WINS, merged_df$PREDICTED_WINS)^2  

# Print results
cat("Model Accuracy Metrics:\n")
## Model Accuracy Metrics:
cat("Mean Absolute Error (MAE):", round(mae, 2), "\n")
## Mean Absolute Error (MAE): 13.73
cat("Root Mean Squared Error (RMSE):", round(rmse, 2), "\n")
## Root Mean Squared Error (RMSE): 17.14
cat("R-Squared (R²):", round(r_squared, 4), "\n")
## R-Squared (R²): 0.0017

Interpretation of Model Accuracy Metrics

The evaluation results indicate that our Improved Model is not performing well in predicting TARGET_WINS.

The Mean Absolute Error (MAE) of 13.87 suggests that, on average, predictions deviate by about 14 wins per team, which is quite high.

Additionally, the Root Mean Squared Error (RMSE) of 17.38 implies that some predictions have even larger errors, highlighting potential inconsistencies or missing key predictors.

Most concerning is the R-Squared (R²) value of 0.001, which means that the model explains almost none of the variation in team wins—essentially, the predictions are no better than random guesses.

This suggests that the model may be overfitting the training data and failing to generalize, or that it lacks critical predictive features.

To improve performance, we should consider adding more relevant predictors (such as ERA or OBP), removing unimportant or noisy variables, and potentially using alternative mo

Applying Prediction to Our Evalution

With our models tested and evaluated, we can no apply our model to our final evaluation dataframe.

predictedWins = predict(stp_logx_model, eval_df)
eval_df["PREDICTED_WINS"] = predictedWins